usability testing of a healthcare chatbot: can we use ... · bers for identifying usability issues...

8
Usability testing of a healthcare chatbot: Can we use conventional methods to assess conversational user interfaces? Samuel Holmes Ulster University School of Communication & Media Newtownabbey, UK [email protected] Anne Moorhead Ulster University School of Communication & Media/Institute of Nursing & Health Research Newtownabbey, UK [email protected] Raymond Bond Ulster University School of Computing Newtownabbey, UK [email protected] Huiru Zheng Ulster University School of Computing Newtownabbey, UK [email protected] Vivien Coates Ulster University Institute of Nursing & Health Research Coleraine, UK [email protected] Michael McTear Ulster University School of Computing Newtownabbey, UK [email protected] ABSTRACT Chatbots are becoming increasingly popular as a human-computer interface. The traditional best practices normally applied to User Experience (UX) design cannot easily be applied to chatbots, nor can conventional usability testing techniques guarantee accuracy. WeightMentor is a bespoke self-help motivational tool for weight loss maintenance. This study addresses the following four research questions: How usable is the WeightMentor chatbot, according to conventional usability methods?; To what extend will different conventional usability questionnaires correlate when evaluating chatbot usability?; And how do they correlate to a tailored chat- bot usability survey score?; What is the optimum number of users required to identify chatbot usability issues?; How many task repe- titions are required for a first-time chatbot users to reach optimum task performance (i.e. efficiency based on task completion times)? This paper describes the procedure for testing the WeightMentor chatbot, assesses correlation between typical usability testing met- rics, and suggests that conventional wisdom on participant num- bers for identifying usability issues may not apply to chatbots. The study design was a usability study. WeightMentor was tested using a pre-determined usability testing protocol, evaluating ease of task completion, unique usability errors and participant opinions on the chatbot (collected using usability questionnaires). WeightMentor usability scores were generally high, and correlation between ques- tionnaires was strong. The optimum number of users for identifying chatbot usability errors was 26, which challenges previous research. Chatbot users reached optimum proficiency in tasks after just one repetition. Usability test outcomes confirm what is already known Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ECCE ’19, September 10–13, 2019, Belfast, UK © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-7166-7-9/10/19. about chatbots - that they are highly usable (due to their simple interface and conversation-driven functionality) but conventional methods for assessing usability and user experience may not be as accurate when applied to chatbots. CCS CONCEPTS Human-centered computing Usability testing; Software and its engineering Software testing and debugging. KEYWORDS Usability Testing, Chatbots, Conversational UI, UX Testing ACM Reference Format: Samuel Holmes, Anne Moorhead, Raymond Bond, Huiru Zheng, Vivien Coates, and Michael McTear. 2019. Usability testing of a healthcare chatbot: Can we use conventional methods to assess conversational user interfaces?. In ECCE ’19: European Conference on Cognitive Ergonomics, September 10–13, 2019, Belfast, UK . ACM, New York, NY, USA, 8 pages. 1 INTRODUCTION Chatbots are becoming increasingly popular as a human-computer interface. The year 2016 was described as "The rise of the chatbot" [20], and major companies including Microsoft, Google, Amazon and Apple have all developed and deployed their own "personal digital assistants" or "smart speakers" which are platforms for chat- bots (also known as voicebots). Interacting with a chatbot is ar- guably more natural and intuitive given that it is like human-human interaction when compared to conventional methods for human- computer interaction. Moreover, given that chatbots integrate with popular social media platforms such as Facebook Messenger or Skype, users are not required to learn new unfamiliar interfaces or even download an app. 1.1 Chatbot UX Design Cameron et al. (2018) suggested that the chatbot development life- cycle is different from traditional development life-cycles [7]. For ex- ample, where conventional user interface design may focus on user

Upload: others

Post on 31-Dec-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Usability testing of a healthcare chatbot: Can we use ... · bers for identifying usability issues may not apply to chatbots. The study design was a usability study. WeightMentor

Usability testing of a healthcare chatbot Can we useconventional methods to assess conversational user interfaces

Samuel HolmesUlster University

School of Communication amp MediaNewtownabbey UK

holmes-wulsteracuk

Anne MoorheadUlster University

School of Communication ampMediaInstitute of Nursing amp Health

ResearchNewtownabbey UK

amoorheadulsteracuk

Raymond BondUlster University

School of ComputingNewtownabbey UKrbbondulsteracuk

Huiru ZhengUlster University

School of ComputingNewtownabbey UKhzhengulsteracuk

Vivien CoatesUlster University

Institute of Nursing amp HealthResearch

Coleraine UKvecoatesulsteracuk

Michael McTearUlster University

School of ComputingNewtownabbey UK

mfmctearulsteracuk

ABSTRACTChatbots are becoming increasingly popular as a human-computerinterface The traditional best practices normally applied to UserExperience (UX) design cannot easily be applied to chatbots norcan conventional usability testing techniques guarantee accuracyWeightMentor is a bespoke self-help motivational tool for weightloss maintenance This study addresses the following four researchquestions How usable is theWeightMentor chatbot according toconventional usability methods To what extend will differentconventional usability questionnaires correlate when evaluatingchatbot usability And how do they correlate to a tailored chat-bot usability survey score What is the optimum number of usersrequired to identify chatbot usability issues How many task repe-titions are required for a first-time chatbot users to reach optimumtask performance (ie efficiency based on task completion times)This paper describes the procedure for testing the WeightMentorchatbot assesses correlation between typical usability testing met-rics and suggests that conventional wisdom on participant num-bers for identifying usability issues may not apply to chatbots Thestudy design was a usability study WeightMentor was tested usinga pre-determined usability testing protocol evaluating ease of taskcompletion unique usability errors and participant opinions on thechatbot (collected using usability questionnaires) WeightMentorusability scores were generally high and correlation between ques-tionnaires was strong The optimum number of users for identifyingchatbot usability errors was 26 which challenges previous researchChatbot users reached optimum proficiency in tasks after just onerepetition Usability test outcomes confirm what is already known

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page Copyrights for components of this work owned by others than theauthor(s) must be honored Abstracting with credit is permitted To copy otherwise orrepublish to post on servers or to redistribute to lists requires prior specific permissionandor a fee Request permissions from permissionsacmorgECCE rsquo19 September 10ndash13 2019 Belfast UKcopy 2019 Copyright held by the ownerauthor(s) Publication rights licensed to ACMACM ISBN 978-1-4503-7166-7-91019

about chatbots - that they are highly usable (due to their simpleinterface and conversation-driven functionality) but conventionalmethods for assessing usability and user experience may not be asaccurate when applied to chatbots

CCS CONCEPTSbullHuman-centered computingrarrUsability testing bull Softwareand its engineeringrarr Software testing and debugging

KEYWORDSUsability Testing Chatbots Conversational UI UX Testing

ACM Reference FormatSamuel Holmes Anne Moorhead Raymond Bond Huiru Zheng VivienCoates and Michael McTear 2019 Usability testing of a healthcare chatbotCan we use conventional methods to assess conversational user interfacesIn ECCE rsquo19 European Conference on Cognitive Ergonomics September 10ndash132019 Belfast UK ACM New York NY USA 8 pages

1 INTRODUCTIONChatbots are becoming increasingly popular as a human-computerinterface The year 2016 was described as The rise of the chatbot[20] and major companies including Microsoft Google Amazonand Apple have all developed and deployed their own personaldigital assistants or smart speakers which are platforms for chat-bots (also known as voicebots) Interacting with a chatbot is ar-guably more natural and intuitive given that it is like human-humaninteraction when compared to conventional methods for human-computer interaction Moreover given that chatbots integrate withpopular social media platforms such as Facebook Messenger orSkype users are not required to learn new unfamiliar interfaces oreven download an app

11 Chatbot UX DesignCameron et al (2018) suggested that the chatbot development life-cycle is different from traditional development life-cycles [7] For ex-ample where conventional user interface design may focus on user

ECCE rsquo19 September 10ndash13 2019 Belfast UK Holmes Moorhead Bond Zheng Coates amp McTear

interface prototyping chatbot design instead focuses on conversa-tion modelling and the interface may be improved by analysinginteraction logs to determine how best to improve conversationflow [7] Best practices for User Experience (UX) design such asSchneidermanrsquos Eight Golden Rules [27] and Nielsenrsquos Ten Usabil-ity Heuristics [18] cannot be easily applied to chatbots which mustinstead replicate human conversation Moore et al (2017) iden-tify three basic principles of conversation design and four mainconversation types summarized in Tables 1 and 2 [17]

Table 1 Basic conversation design principles (Moore et al2017)

Principle Description

Recipient Design Tailoring conversation to match userrsquoslevel of understanding

Minimization Keeping interactions as short and simpleas possible

Repair Recovering from failures and helping un-derstanding (eg repeatingrephrasing)

Table 2 Main conversation types (Moore et al 2017)

Conversation Type Description

Ordinary Conversation Casual and unrestrained (egsmall talk)

Service Conversation Constrained by rules androles (eg customer serviceagentcustomer)

Teaching Conversation Happens between teacher andstudent Teacher probes to testknowledge

Counselling Conversation Counselee leads the conversa-tion seeking advice from thecounsellor

12 Testing Usability of ChatbotsUsability tests are generally conducted according to standard prac-tices using standardized tools However in some cases it is nec-essary to modify test protocols to account for characteristics ofparticipants or technologies Gibson et al (2016) described a us-ability test of an app designed for aiding reminiscence in peopleliving with dementia In this case it was necessary to favour us-ability testing metrics such as task completion over others such asconcurrent think-aloud and usability questionnaires [12] It may benecessary to similarly modify traditional usability testing methodswhen testing chatbots UX247 describe four main issues that maybe encountered by users of a traditional system such as a websiteThese four issues are Language Branding Functions and Informa-tion Retrieval [29] If language used on a website is too complex auser may struggle to understand Chatbot users may face the same

issue as chatbots are conversation-based Branding in websites andsoftware is generally always visual using recognizable graphicsand colours Chatbots on the other hand are conversation driventhus the conversation and tone of voice need to reflect the brandIf functions of a website or system are poorly designed this willreduce the usability of the site In conversation-based systems func-tions may be considered equal to conversations and conversationflow which if poorly designed will also affect usability Finallyinformation retrieval must be accurate In a web-based system forexample a poorly designed search function may result in incorrectinformation being returned to the user but in chatbot terms thismay happen if the chatbot incorrectly interprets what the user saysor misunderstands their question [29]

A 2018 study by Nielsen-Norman group suggested that severalaspects of chatbots should be tested in order to validate the UX [5]These include interaction style (eg buttons and links vs text entry)conversation flow language and privacy Nevertheless it is evidentthat testing the usability of chatbots might require new methodsbeyond the conventional usability engineering instruments sincechatbots offer a very different kind of human-computer interactionFurthermore given that usability validation of medical devices andhealthcare software is often a requirement for FDA approval mea-suring the usability of a healthcare focused chatbot is an importantresearch topic

13 Chatbots in HealthcareWhilst the research is primitive chatbots have been shown to beof use as therapeutic healthcare interventions or for at least aug-menting traditional healthcare interventions Barak et al (2009)reported that the ability of chatbots to create empathy and reactto emotions resulted in higher compliance with therapeutic treat-ments [3] Healthcare focused chatbots facilitate increased userengagement and increased usability [10] and may also solve issueswith text message-based systems such as 24-hour availability andautomated messages sounding less human [9] In response to grow-ing waiting lists and difficulties in accessing mental health servicesCameron et al (2018) developed iHelpr a self-help mental healthchatbot It is suggested that conversational interfaces may soonreplace web pages and smartphone apps as the preferred means ofconducting online tasks [8] Chatbots are suggested to be of use inthe area of mental health because they provide instant access tohelp and support and increased efficiency [6]

14 The WeightMentor ChatbotIn this paper we describe a study that assessed the usability of ahealthcare focused chatbot calledWeightMentor which we have de-veloped at Ulster University This chatbot is a bespoke self-help mo-tivational tool for weight loss maintenance with the purpose of sup-ports weight loss maintenance by encouraging self-reporting per-sonalized feedback and motivational dialogues [13] Self-reportingpersonalized feedback and motivation have been shown to be ben-eficial for weight loss maintenance in the short term [9] [11] Thispaper involves the usability testing of the WeightMentor chatbotand uses this experiment as a case study to help answer several keyresearch questions in this field

Chatbots Can conventional methods be used to test usability ECCE rsquo19 September 10ndash13 2019 Belfast UK

2 RESEARCH QUESTIONS(1) How usable is the WeightMentor Chatbot according to con-

ventional usability methods(2) To what extend will different conventional usability ques-

tionnaires correlate when evaluating chatbot usability Andhow do they correlate to a tailored chatbot usability surveyscore

(3) What is the optimum number of users required to identifychatbot usability issues

(4) How many task repetitions are required for a first-time chat-bot users to reach optimum task performance (ie efficiencybased on task completion times)

Question 3 is of interest as knowing howmany subjects to recruitto identify most of the usability issues in a chatbot is importantfor planning usability studies in the UX industry Moreover it isalso important given that studies have reported that most usabilityissues of traditional human-computer systems can be identifiedwhen recruiting 5-8 subjects [19] However this may not hold truefor chatbot interfaces

3 METHODS31 Research DesignThis usability test was an observational research design testingthe usability of the chatbot and comparing a novel usability ques-tionnaire specifically designed for chatbot testing with existingquestionnaires Ethical approval for this study was obtained fromthe School of Communication amp Media Filter (Research Ethics)Committee Ulster University in October 2019

32 Chatbot DevelopmentThe WeightMentor chatbot was developed for Facebook Messen-ger as Facebook is currently the most popular social media toolon smartphones and mobile devices [21] [22] The chatbot conver-sation flow was designed using DialogFlow which is a powerfulframework integrating Googlersquos machine learning and natural lan-guage understanding capabilities The basic chatbot functionalitywas supplemented using a custom built NodeJS app hosted onHeroku

33 Test ProtocolFigure 1 shows a photograph of a typical usability testing setupThe MOD1000 camera records the userrsquos hand movements as theyinteract with the chatbot The laptop runs software to capture videofrom the MOD1000 and audio as part of the concurrent think-aloudaspect of the usability test

Usability tests were conducted as follows

(1) Participant signed the consent form(2) Participant was given access to WeightMentor(3) Test coordinator read a briefing to the participant(4) Participant completed pre-test (demographic) questionnaire(5) Test coordinator read each task to the participant(6) Participant confirmed their understanding of the task

Figure 1 Typical usability testing setup

(7) Participant answered pre-task Single Ease Question (On ascale of 1 to 7 how easy do you think it will be to completethis task)

(8) Participant was video and audio recorded completing taskParticipant talked through what they could see on screenand what they were trying to do during task (known as theconcurrent think aloud protocol)

(9) Participant answered post-task Single Ease Question (On ascale of 1 to 7 how easy was it to complete this task)

(10) After all tasks each participant completed Post-Test usabil-ity surveys including System Usability Scale (SUS) survey[4] User Experience Questionnaire (UEQ) [15] and a newChatbot Usability Questionnaire (CUQ) that we developedfor this study (refer to figure 2)

34 Usability MetricsBaki Kocaballi et al (2018) suggested that multiple metrics may bemore appropriate for measuring chatbot usability [14] Thus threemetrics were selected to evaluate the usability ofWeightMentor theSUS scores UEQ metrics and our own CUQ score (although CUQ isyet to be validated but is has been designed to measure the usabilityof chatbots as opposed to measuring the general usability of human-computer systems) In addition wemeasured task completion times

ECCE rsquo19 September 10ndash13 2019 Belfast UK Holmes Moorhead Bond Zheng Coates amp McTear

usability issues and the differential between pre-task SEQ answersand post-task SEQ answers

341 System Usability Scale (SUS) SUS was designed as a quickand easy means of assessing usability [4] and today is one of themost commonly used usability testing tools SUS is comprised often validated statements covering five positive aspects and fivenegative aspects of the system Participants score each questionout of five Final scores are out of 100 and may be compared withthe SUS benchmark (currently 680 representing an average score)or interpreted using several possible scales

SUS scores can be grouped into percentile ranges [25] Alterna-tively a grade system may be used Bangor et al (2009) proposean adjective based scale ranging from Worst Imaginable (SUSgrades 0 - 25) to Best Imaginable (SUS grades over 841) [2] TheAcceptability Scale [1] ranges from Not Acceptable (SUS lt 626)to Acceptable (SUS gt 711) Finally SUS scores may be linked withthe Net Promoter Score (NPS) designed for measuring customerloyalty [23] SUS scores greater than 788 may be classed as Pro-moters below 627 will be Detractors and scores in between arePassive [24]

The benchmark SUS score falls within the 41st - 59th percentileIt is grade C Good on the adjective scale Marginal on the ac-ceptability scale and NPC classification Passive

342 User Experience Questionnaire (UEQ) The UEQ serves as ameans of comprehensively assessing the UX [15] It is based onsix scales summarised in Table 3 Scales are measured using pairsof opposite adjectives to describe the system with participantsselecting their level of agreement with each UEQ scores assess theextent to which the system meets expectations but more usefullymay be compared with a benchmark to determine how the systemunder test compares to other systems

Table 3 UEQ Scales

Scale What does it measure

Attractiveness Extent to which users like the systemPerspicuity Ease of learning and becoming proficient

in the systemEfficiency Effort required to complete tasks System

reaction timesDependability Extent of user control System predictabil-

itysecurityStimulation How funexciting is it to use the systemNovelty CreativenessInterest for users

343 Chatbot Usability Questionnaire (CUQ) The CUQ is basedon the chatbot UX principles provided by the ALMA Chatbot Testtool [16] which assess personality onboarding navigation under-standing responses error handling and intelligence of a chatbotThe CUQ is designed to be comparable to SUS except it is bespokefor chatbots and includes 16 items Participantsrsquo levels of agreementwith sixteen statements relating to positive and negative aspectsof the chatbot are ranked out of five from Strongly Disagree toStrongly Agree Statements used in the CUQ are listed in Table 4

Table 4 Statements used in the novel and bespoke but rsquoun-validatedrsquo Chatbot Usability Questionnaire (CUQ)

QuestionNumber

Question

1 The chatbotrsquos personality was realistic and en-gaging

2 The chatbot seemed too robotic3 The chatbot was welcoming during initial setup4 The chatbot seemed very unfriendly5 The chatbot explained its scope and purpose well6 The chatbot gave no indication as to its purpose7 The chatbot was easy to navigate8 It would be easy to get confused when using the

chatbot9 The chatbot understood me well10 The chatbot failed to recognise a lot of my inputs11 Chatbot responses were useful appropriate and

informative12 Chatbot responses were not relevant13 The chatbot coped well with any errors or mis-

takes14 The chatbot seemed unable to handle any errors15 The chatbot was very easy to use16 The chatbot was very complex

35 Questionnaire Analysis351 SUS Calculation The SUS score calculation spreadsheet [28]was used to calculate SUS scores out of 100 The formula for thiscalculation is shown in equation 1

SUS =1n

nsumi=1

normmsumj=1

qi jminus1qi jmod2gt0

5minusqi j otherwise

(1)

where n=number of subjects (questionnaires) m=10 (numberof questions) qij=individual score per question per participantnorm=25

352 UEQ Calculation The UEQ Data Analysis Tool [26] analysesquestionnaire data and presents results graphically By default theUEQ does not generate a single score for each participant but insteadprovides six scores one for each attribute To facilitate correlationanalysis using the UEQ a acircĂIJmean UEQ scoreacircĂİ from the sixscores was calculated for each participant The mean UEQ score isthe mean of the scores for all six scales of the UEQ per participant

353 CUQ Calculation CUQ scores were calculated out of 160using the formula in equation 2 and then normalized to give a scoreout of 100 to permit comparison with SUS

CUQ =

(( msumn=1

2n minus 1

)minus 5

)+

(25 minus

( msumn=1

2n

))times 16 (2)

where m = 16 (number of questions) and n = individual questionscore per participant

Chatbots Can conventional methods be used to test usability ECCE rsquo19 September 10ndash13 2019 Belfast UK

36 Data Analysis amp Presentation ToolsR Studio and Microsoft Excel were used for data analysis Box plotswere produced using R Studio and barline plots were producedusing excel Statistical tests such as correlation tests and t-testswere conducted in R Studio Correlation analysis was conductedusing Pearson correlation coefficient Mean task completion timeswere compared for significance using paired Wilcoxon analysisThe p-value threshold for statistical significance was 5 (or 005)

4 RESULTS41 WeightMentor Chatbot Usability411 System Usability Scale A total of 30 participants (healthyadults) were recruited andwho evaluated the usability of theWeight-Mentor chatbot A boxplot ofWeightMentor SUS scores is shownin Figure 2 The meanWeightMentor SUS score was 8483 plusmn 1203and the median was 8625 The highest score was 1000 and thelowest was 5750 The meanWeightMentor score places the chatbotin the 96th-100th percentile range equivalent to Grade A+ BestImaginable Acceptable and NPS Class Promoter on the variousscales discussed in methods Hence the chatbot has a high degreeof usability according to traditional usability SUS scores HoweverSUS distributions for benchmarking have not included usabilityscores from testing chatbots

Figure 2 WeightMentor SUS scores (Benchmark of 680 ismarked)

412 User Experience Questionnaire The chatbot scored highly inall UEQ scales Scores werewell above benchmark and are presentedgraphically in Figure 3 Participant scores for each scale were allabove +08 suggesting that in general participants were satisfiedwith theWeightMentor user experience

Figure 3WeightMentor UEQ scores against benchmark

413 Chatbot UsabilityQuestionnaire A box plot ofWeightMentorCUQ scores is shown in Figure 4 The mean score was 7620plusmn 1146and the median was 765 The highest score was 1000 and lowestwas 484

Figure 4WeightMentor CUQ scores

42 Usability Questionnaire CorrelationCorrelations between each usability survey score and scatter plotsare shown in Figure 5 All correlations are statistically significantsince the p-values are all below 005 (5) Multiple Regression wasused to determine if CUQ score can be determined by SUS andMean UEQ score Results are shown in Table 5

43 Chatbot Usability IssuesThirty participants identified fifty-three usability issues Usabilityissues were selected based on think-aloud data and feedback fromparticipants in the Post-Test Survey Usability issues identified perparticipant were listed in a spreadsheet in the chronological orderin which the tests were conducted All usability issues identifiedby the first participant were treated as unique and usability issuesidentified by subsequent participants were considered unique ifthey had not been identified by a previous participant All uniqueusability issues identified at each test are plotted as a line graph inFigure 6

To determine the best-case scenario participants were re-sortedin a spreadsheet in descending order based on the number of us-ability issues identified per participant Where more than one par-ticipant identified the same number of issues these were sorted inchronological order of participants Usability issues identified bythe first participant were treated as unique and subsequent issueswere counted only if they had not yet been identified by previousparticipants Identified usability issues were plotted against numberof participants as a line graph shown in Figure 6 In this scenariothe first participant identifies the most usability issues and the lastparticipant identifies the least This represents the best case (iethe least number of subjects required to identify almost all of theusability issues)

To determine the worst-case scenario participants for best casescenario were sorted in reverse order and usability issues wereidentified as for previous line graphs The graph of this scenariois also shown in Figure 6 where the first participant identified thefewest usability issues whilst the last participant identified themost usability issues

ECCE rsquo19 September 10ndash13 2019 Belfast UK Holmes Moorhead Bond Zheng Coates amp McTear

Figure 5 Scatter plots of (a) SUS and CUQ scores (b) SUS and UEQ mean scores (c) CUQ and UEQ mean scores

Table 5 Multiple regression results where SUS and UEQ are independent variables and CUQ is the dependent variable

Coefficients Estimate Std Error t-value p-value

Intercept 360128 85217 4226 0000243SUS 02578 01307 1973 0058854UEQ 103694 21264 4876 00000425Residual Standard Error 5253 on 27 degrees of freedomMultiple R-Squared 08046 Adjusted R-Squared 07901F-statistic 5558 on 2 and 27 DF p-value 00000000002681

44 Chatbot Task CompletionDuring the usability tests participantsacircĂŹ interactions with thechatbot were video recorded Task completion times were calculatedfor each individual participant along with the mean time overallA benchmark task completion time was established by recordingthe performance of the developer (SH) Two tasks (task 2 and task3) were repeated four times by participants to determine if taskcompletion times improved with each repetition Bar plots of taskcompletion times against the benchmark are shown in Figure 7The p-values for repeated tasks are shown in Table 6

5 DISCUSSIONWeightMentor mean SUS score places the chatbot at the top of thevarious scales discussed in section 331 above This suggests thatin comparison to conventional systems WeightMentor is highlyusable Similarly WeightMentor UEQ scores were favourable whencompared to UEQ benchmark However SUS and UEQ benchmarkscores are derived from tests of conventional systems such as web-sites therefore do not include any chatbot scores Thus whileWeightMentor may score highly using these metrics it is impossi-ble to determine the accuracy of this score relative to other chatbots

Chatbots Can conventional methods be used to test usability ECCE rsquo19 September 10ndash13 2019 Belfast UK

Figure 6 Usability issues identified against number of par-ticipants

Figure 7 Task completion times against benchmark (a) pertask with repetitions (b) per task overall

Correlation between the three main questionnaires was generally

Table 6 Task completion time p-values

Comparison p-value

Task 2 Attempt 1 amp Attempt 2 0002621Attempt 2 amp Attempt 3 03457Attempt 3 amp Attempt 4 08284

Task 3 Attempt 1 amp Attempt 2 006736Attempt 2 amp Attempt 3 09249Attempt 3 amp Attempt 4 01031

strong and was highest between the CUQ and UEQ Multiple Re-gression results suggest that 80 of the variance in CUQ score canbe explained by SUS and UEQ Thus 20 of the variance in CUQscore is explained by other factors ie CUQ is perhaps measuringconstructs more closely related to chatbots that is not being mea-sured by SUS and UEQ The p-value for SUS is greater than 005(5) thus it is not statistically significant in predicting CUQ scoresand the p-value for UEQ is less than 005 hence this is statisticallysignificant in predicting CUQ score This indicates that UEQ hasa greater influence on CUQ score than SUS which suggests thatUEQ and CUQ are measuring similar aspects of chatbot usabilityand UX This finding can be rationalized given that UEQ is morecomplex when compared to SUS The implication of these findingsis that SUS may be less effective for measuring chatbot usability onits own as it may not measure aspects of UX that make chatbotsparticularly usable

Analysis of identified usability issues suggested that in the best-case scenario the optimum number of users is 21 (since the functionin figure 6 plateaus thereafter) In the worst-case scenario theoptimum number of users is 30 In chronological order the optimumnumber was 26 - 29 users The mean number of users required toidentify most of the usability issues in the chatbot is 26 Nielsen ampLandauer (1993) determined that 80 of unique usability issues maybe captured by no more than 5 to 8 users however this researchconcerned conventional systems not chatbots and it may be thecase that the nature of chatbots makes it more difficult to identifyusability issues with a smaller number of participants [19]

In general task completion times did improve with each repeti-tion of a task however while the time difference was statisticallysignificant between the first attempt and the second attempt at task2 it was not statistically significant between the second and thirdand third and fourth attempts Similarly time differences were notstatistically significant between any of the attempts at task 3 whichmay be because task 3 was very similar in procedure to task 2 Thissuggests that it may be possible for users to become proficient witha new chatbot very quickly owing to their simplicity and ease ofuse

6 CONCLUSIONChatbot UX design and usability testing may require nontraditionalmethods and multiple metrics are likely to provide a more com-prehensive picture of chatbot usability The WeightMentor chatbotscored highly on both SUS and UEQ scales and correlation anal-yses suggest that correlation is stronger between CUQ and UEQthan CUQ and SUS and validation of the CUQ will increase its

ECCE rsquo19 September 10ndash13 2019 Belfast UK Holmes Moorhead Bond Zheng Coates amp McTear

effectiveness as a usability analysis tool Given that the variance ofCUQ scores are not completely explained by SUS and UEQ there isan argument that these traditional usability testing surveys do notevaluate all aspects of a chatbot interface This study also suggeststhat approximately 26 subjects are required to identify almost allthe usability issues in the chatbot which challenges previous re-search Whilst primitive this work also suggests that users becomeoptimal after just one attempt of a task when using a chatbot Thiscould be explained by the fact that chatbots can be less complex inthat they lack visual hierarchy and complexity as seen in normallygraphical user interfaces This study suggests that conventional us-ability metrics may not be best suited to assessing chatbot usabilityand that metrics will be most effective if they measure aspects ofusability that are more closely related to chatbots Chatbot usabilitytesting may also potentially require a greater number of users thansuggested by previous research in order to maximize capture of us-ability issues Finally this study suggests that users can potentiallyreach optimum proficiency with chatbots very quickly

ACKNOWLEDGMENTSThis research study is part of a PhD research project at the UlsterUniversity funded by the Department for the Economy NorthernIreland

REFERENCES[1] Aaron Bangor Philip T Kortum and James T Miller 2008 An Empirical Evalua-

tion of the System Usability Scale International Journal of Human Computer In-teraction 24 6 (0729 2008) 574ndash594 httpsdoiorg10108010447310802205776doi 10108010447310802205776

[2] Aaron Bangor Philip T Kortum and James T Miller 2009 Determining WhatIndividual SUS Scores Mean Adding an Adjective Rating Scale Journal ofUsability Studies 4 3 (2009) 114ndash123

[3] Azy Barak Britt Klein and Judith G Proudfoot 2009 Defining Internet-Supported Therapeutic Interventions Annals of Behavioral Medicine 38 1 (08012009) 4ndash17 httpsdoiorg101007s12160-009-9130-7 ID Barak2009

[4] John Brooke 1996 SUS A rsquoquick and dirtyrsquo usability scale CRC Press 189ndash194[5] Raluca Budiu 2018 The User Experience of Chatbots[6] Gillian Cameron David Cameron Gavin Megaw Raymond Bond Maurice Mul-

venna Siobhan OrsquoNeill Cherie Armour and Michael McTear 2017 Towards aChatbot for Digital Counselling In Proceedings of the 31st British Computer SocietyHuman Computer Interaction Conference (HCI rsquo17) BCS Learning amp DevelopmentLtd Swindon UK 241ndash247 httpsdoiorg1014236ewicHCI201724

[7] Gillian Cameron David Cameron Gavin Megaw RR Bond Maurice MulvennaSiobhan OrsquoNeill C Armour and Michael McTear 2018 Back to the Future Lessonsfrom Knowledge Engineering Methodologies for Chatbot Design and Developmenthttpsdoiorg1014236ewicHCI2018153

[8] Gillian Cameron David Cameron Gavin Megaw RR Bond Maurice MulvennaSiobhan OrsquoNeill C Armour andMichael McTear 2018 Best practices for designingchatbots in mental healthcare acircĂŞ A case study on iHelpr httpsdoiorg1014236ewicHCI2018129

[9] E L Donaldson S Fallows and M Morris 2014 A text message based weightmanagement intervention for overweight adults Journal of Human Nutrition ampDietetics 27 Suppl 2 (Apr 2014) 90ndash97 httpovidspovidcomathensovidwebcgiT=JSampCSC=YampNEWS=NampPAGE=fulltextampD=medlampAN=23738786

[10] AH Fadil and S Gabrielli 2017 Addressing Challenges in Promoting HealthyLifestyles The AI-Chatbot Approach

[11] Brianna S Fjeldsoe Ana D Goode Philayrath Phongsavan Adrian BaumanGenevieve Maher Elisabeth Winkler and Elizabeth G Eakin 2016 Evaluat-ing the Maintenance of Lifestyle Changes in a Randomized Controlled Trialof the rsquoGet Healthy Stay Healthyrsquo Program JMIR MHealth and UHealth 4 2(2016) e42 httpovidspovidcomathensovidwebcgiT=JSampCSC=YampNEWS=NampPAGE=fulltextampD=premampAN=27166643

[12] Aideen Gibson Claire McCauley Maurice Mulvenna Assumpta Ryan Liz LairdKevin Curran Brendan Bunting Finola Ferry and Raymond Bond 2016 As-sessing Usability Testing for People Living with Dementia In Proceedings of the4th Workshop on ICTs for Improving Patients Rehabilitation Research Techniques(REHAB rsquo16) ACM New York NY USA 25ndash31 httpsdoiorg10114530514883051492

[13] Samuel Holmes Anne Moorhead RR Bond Huiru Zheng Vivien Coates andMichael McTear 2018 A New Automated Chatbot for Weight Loss Mainte-nance In Proceedings of the 32nd International BCS Human Computer InteractionConference (HCI 2018) httpsdoiorg1014236ewicHCI2018103

[14] A Baki Kocaballi L Laranjo and E Coiera 2018 Measuring User Experience inConversational Interfaces A Comparison of Six Questionnaires In Proceedings ofthe 32nd International BCS Human Computer Interaction Conference (HCI 2018)

[15] Bettina Laugwitz Theo Held and Martin Schrepp 2008 Construction andEvaluation of a User Experience Questionnaire In HCI and Usability for Educationand Work 4th Symposium of the Workgroup Human-Computer Interaction andUsability Engineering of the Austrian Computer Society USAB Vol 5298 63ndash76httpsdoiorg101007978-3-540-89350-9_6

[16] J MartAtildeŋn C MuAtildeśoz-Romero and N AtildeĄbalos 2017 chatbottest - Improveyour chatbotrsquos design

[17] Robert J Moore Raphael Arar Guang-Jie Ren and Margaret H Szymanski 2017Conversational UX Design In Proceedings of the 2017 CHI Conference ExtendedAbstracts on Human Factors in Computing Systems (CHI EA rsquo17) ACM New YorkNY USA 492ndash497 httpsdoiorg10114530270633027077

[18] J Nielsen 1994 10 Heuristics for User Interface design[19] Jakob Nielsen and Thomas K Landauer 1993 A Mathematical Model of the

Finding of Usability Problems In Proceedings of the INTERCHI rsquo93 Conference onHuman Factors in Computing Systems (INTERCHI rsquo93) IOS Press AmsterdamThe Netherlands The Netherlands 206ndash213 httpdlacmorgcitationcfmid=164632164904

[20] Wharton University of Pennsylvania 2016 The Rise of the Chatbots Is It Timeto Embrace Them

[21] Ofcom 2017 The UK Communications Market Telecoms amp Networks TechnicalReport Ofcom

[22] Kelly Pedotto Vivey Chen and JP McElyea 2017 The 2017 US Mobile AppReport

[23] Fred F Reichheld 2003 The One Number You Need to Grow[24] Jeff Sauro 2012 Predicting Net Promoter scores from System Usability Scale

Scores[25] Jeff Sauro 2018 5 ways to interpret a SUS score[26] Martin Schrepp 2019 User Experience Questionnaire Data Analysis Tool[27] B Shneiderman C Plaisant M Cohen S Jacobs and N Elmqvist 2016 Designing

the User Interface Strategies for Effective Human-Computer Interaction (sixthedition ed) Pearson

[28] T Tullis and B Albert 2018 Measuring the User Experience SUS CalculationTool

[29] UX247 2018 Chatbots Usability Testing amp Chatbots

  • Abstract
  • 1 Introduction
    • 11 Chatbot UX Design
    • 12 Testing Usability of Chatbots
    • 13 Chatbots in Healthcare
    • 14 The WeightMentor Chatbot
      • 2 Research Questions
      • 3 Methods
        • 31 Research Design
        • 32 Chatbot Development
        • 33 Test Protocol
        • 34 Usability Metrics
        • 35 Questionnaire Analysis
        • 36 Data Analysis amp Presentation Tools
          • 4 Results
            • 41 WeightMentor Chatbot Usability
            • 42 Usability Questionnaire Correlation
            • 43 Chatbot Usability Issues
            • 44 Chatbot Task Completion
              • 5 Discussion
              • 6 Conclusion
              • Acknowledgments
              • References
Page 2: Usability testing of a healthcare chatbot: Can we use ... · bers for identifying usability issues may not apply to chatbots. The study design was a usability study. WeightMentor

ECCE rsquo19 September 10ndash13 2019 Belfast UK Holmes Moorhead Bond Zheng Coates amp McTear

interface prototyping chatbot design instead focuses on conversa-tion modelling and the interface may be improved by analysinginteraction logs to determine how best to improve conversationflow [7] Best practices for User Experience (UX) design such asSchneidermanrsquos Eight Golden Rules [27] and Nielsenrsquos Ten Usabil-ity Heuristics [18] cannot be easily applied to chatbots which mustinstead replicate human conversation Moore et al (2017) iden-tify three basic principles of conversation design and four mainconversation types summarized in Tables 1 and 2 [17]

Table 1 Basic conversation design principles (Moore et al2017)

Principle Description

Recipient Design Tailoring conversation to match userrsquoslevel of understanding

Minimization Keeping interactions as short and simpleas possible

Repair Recovering from failures and helping un-derstanding (eg repeatingrephrasing)

Table 2 Main conversation types (Moore et al 2017)

Conversation Type Description

Ordinary Conversation Casual and unrestrained (egsmall talk)

Service Conversation Constrained by rules androles (eg customer serviceagentcustomer)

Teaching Conversation Happens between teacher andstudent Teacher probes to testknowledge

Counselling Conversation Counselee leads the conversa-tion seeking advice from thecounsellor

12 Testing Usability of ChatbotsUsability tests are generally conducted according to standard prac-tices using standardized tools However in some cases it is nec-essary to modify test protocols to account for characteristics ofparticipants or technologies Gibson et al (2016) described a us-ability test of an app designed for aiding reminiscence in peopleliving with dementia In this case it was necessary to favour us-ability testing metrics such as task completion over others such asconcurrent think-aloud and usability questionnaires [12] It may benecessary to similarly modify traditional usability testing methodswhen testing chatbots UX247 describe four main issues that maybe encountered by users of a traditional system such as a websiteThese four issues are Language Branding Functions and Informa-tion Retrieval [29] If language used on a website is too complex auser may struggle to understand Chatbot users may face the same

issue as chatbots are conversation-based Branding in websites andsoftware is generally always visual using recognizable graphicsand colours Chatbots on the other hand are conversation driventhus the conversation and tone of voice need to reflect the brandIf functions of a website or system are poorly designed this willreduce the usability of the site In conversation-based systems func-tions may be considered equal to conversations and conversationflow which if poorly designed will also affect usability Finallyinformation retrieval must be accurate In a web-based system forexample a poorly designed search function may result in incorrectinformation being returned to the user but in chatbot terms thismay happen if the chatbot incorrectly interprets what the user saysor misunderstands their question [29]

A 2018 study by Nielsen-Norman group suggested that severalaspects of chatbots should be tested in order to validate the UX [5]These include interaction style (eg buttons and links vs text entry)conversation flow language and privacy Nevertheless it is evidentthat testing the usability of chatbots might require new methodsbeyond the conventional usability engineering instruments sincechatbots offer a very different kind of human-computer interactionFurthermore given that usability validation of medical devices andhealthcare software is often a requirement for FDA approval mea-suring the usability of a healthcare focused chatbot is an importantresearch topic

13 Chatbots in HealthcareWhilst the research is primitive chatbots have been shown to beof use as therapeutic healthcare interventions or for at least aug-menting traditional healthcare interventions Barak et al (2009)reported that the ability of chatbots to create empathy and reactto emotions resulted in higher compliance with therapeutic treat-ments [3] Healthcare focused chatbots facilitate increased userengagement and increased usability [10] and may also solve issueswith text message-based systems such as 24-hour availability andautomated messages sounding less human [9] In response to grow-ing waiting lists and difficulties in accessing mental health servicesCameron et al (2018) developed iHelpr a self-help mental healthchatbot It is suggested that conversational interfaces may soonreplace web pages and smartphone apps as the preferred means ofconducting online tasks [8] Chatbots are suggested to be of use inthe area of mental health because they provide instant access tohelp and support and increased efficiency [6]

14 The WeightMentor ChatbotIn this paper we describe a study that assessed the usability of ahealthcare focused chatbot calledWeightMentor which we have de-veloped at Ulster University This chatbot is a bespoke self-help mo-tivational tool for weight loss maintenance with the purpose of sup-ports weight loss maintenance by encouraging self-reporting per-sonalized feedback and motivational dialogues [13] Self-reportingpersonalized feedback and motivation have been shown to be ben-eficial for weight loss maintenance in the short term [9] [11] Thispaper involves the usability testing of the WeightMentor chatbotand uses this experiment as a case study to help answer several keyresearch questions in this field

Chatbots Can conventional methods be used to test usability ECCE rsquo19 September 10ndash13 2019 Belfast UK

2 RESEARCH QUESTIONS(1) How usable is the WeightMentor Chatbot according to con-

ventional usability methods(2) To what extend will different conventional usability ques-

tionnaires correlate when evaluating chatbot usability Andhow do they correlate to a tailored chatbot usability surveyscore

(3) What is the optimum number of users required to identifychatbot usability issues

(4) How many task repetitions are required for a first-time chat-bot users to reach optimum task performance (ie efficiencybased on task completion times)

Question 3 is of interest as knowing howmany subjects to recruitto identify most of the usability issues in a chatbot is importantfor planning usability studies in the UX industry Moreover it isalso important given that studies have reported that most usabilityissues of traditional human-computer systems can be identifiedwhen recruiting 5-8 subjects [19] However this may not hold truefor chatbot interfaces

3 METHODS31 Research DesignThis usability test was an observational research design testingthe usability of the chatbot and comparing a novel usability ques-tionnaire specifically designed for chatbot testing with existingquestionnaires Ethical approval for this study was obtained fromthe School of Communication amp Media Filter (Research Ethics)Committee Ulster University in October 2019

32 Chatbot DevelopmentThe WeightMentor chatbot was developed for Facebook Messen-ger as Facebook is currently the most popular social media toolon smartphones and mobile devices [21] [22] The chatbot conver-sation flow was designed using DialogFlow which is a powerfulframework integrating Googlersquos machine learning and natural lan-guage understanding capabilities The basic chatbot functionalitywas supplemented using a custom built NodeJS app hosted onHeroku

33 Test ProtocolFigure 1 shows a photograph of a typical usability testing setupThe MOD1000 camera records the userrsquos hand movements as theyinteract with the chatbot The laptop runs software to capture videofrom the MOD1000 and audio as part of the concurrent think-aloudaspect of the usability test

Usability tests were conducted as follows

(1) Participant signed the consent form(2) Participant was given access to WeightMentor(3) Test coordinator read a briefing to the participant(4) Participant completed pre-test (demographic) questionnaire(5) Test coordinator read each task to the participant(6) Participant confirmed their understanding of the task

Figure 1 Typical usability testing setup

(7) Participant answered pre-task Single Ease Question (On ascale of 1 to 7 how easy do you think it will be to completethis task)

(8) Participant was video and audio recorded completing taskParticipant talked through what they could see on screenand what they were trying to do during task (known as theconcurrent think aloud protocol)

(9) Participant answered post-task Single Ease Question (On ascale of 1 to 7 how easy was it to complete this task)

(10) After all tasks each participant completed Post-Test usabil-ity surveys including System Usability Scale (SUS) survey[4] User Experience Questionnaire (UEQ) [15] and a newChatbot Usability Questionnaire (CUQ) that we developedfor this study (refer to figure 2)

34 Usability MetricsBaki Kocaballi et al (2018) suggested that multiple metrics may bemore appropriate for measuring chatbot usability [14] Thus threemetrics were selected to evaluate the usability ofWeightMentor theSUS scores UEQ metrics and our own CUQ score (although CUQ isyet to be validated but is has been designed to measure the usabilityof chatbots as opposed to measuring the general usability of human-computer systems) In addition wemeasured task completion times

ECCE rsquo19 September 10ndash13 2019 Belfast UK Holmes Moorhead Bond Zheng Coates amp McTear

usability issues and the differential between pre-task SEQ answersand post-task SEQ answers

341 System Usability Scale (SUS) SUS was designed as a quickand easy means of assessing usability [4] and today is one of themost commonly used usability testing tools SUS is comprised often validated statements covering five positive aspects and fivenegative aspects of the system Participants score each questionout of five Final scores are out of 100 and may be compared withthe SUS benchmark (currently 680 representing an average score)or interpreted using several possible scales

SUS scores can be grouped into percentile ranges [25] Alterna-tively a grade system may be used Bangor et al (2009) proposean adjective based scale ranging from Worst Imaginable (SUSgrades 0 - 25) to Best Imaginable (SUS grades over 841) [2] TheAcceptability Scale [1] ranges from Not Acceptable (SUS lt 626)to Acceptable (SUS gt 711) Finally SUS scores may be linked withthe Net Promoter Score (NPS) designed for measuring customerloyalty [23] SUS scores greater than 788 may be classed as Pro-moters below 627 will be Detractors and scores in between arePassive [24]

The benchmark SUS score falls within the 41st - 59th percentileIt is grade C Good on the adjective scale Marginal on the ac-ceptability scale and NPC classification Passive

342 User Experience Questionnaire (UEQ) The UEQ serves as ameans of comprehensively assessing the UX [15] It is based onsix scales summarised in Table 3 Scales are measured using pairsof opposite adjectives to describe the system with participantsselecting their level of agreement with each UEQ scores assess theextent to which the system meets expectations but more usefullymay be compared with a benchmark to determine how the systemunder test compares to other systems

Table 3 UEQ Scales

Scale What does it measure

Attractiveness Extent to which users like the systemPerspicuity Ease of learning and becoming proficient

in the systemEfficiency Effort required to complete tasks System

reaction timesDependability Extent of user control System predictabil-

itysecurityStimulation How funexciting is it to use the systemNovelty CreativenessInterest for users

343 Chatbot Usability Questionnaire (CUQ) The CUQ is basedon the chatbot UX principles provided by the ALMA Chatbot Testtool [16] which assess personality onboarding navigation under-standing responses error handling and intelligence of a chatbotThe CUQ is designed to be comparable to SUS except it is bespokefor chatbots and includes 16 items Participantsrsquo levels of agreementwith sixteen statements relating to positive and negative aspectsof the chatbot are ranked out of five from Strongly Disagree toStrongly Agree Statements used in the CUQ are listed in Table 4

Table 4 Statements used in the novel and bespoke but rsquoun-validatedrsquo Chatbot Usability Questionnaire (CUQ)

QuestionNumber

Question

1 The chatbotrsquos personality was realistic and en-gaging

2 The chatbot seemed too robotic3 The chatbot was welcoming during initial setup4 The chatbot seemed very unfriendly5 The chatbot explained its scope and purpose well6 The chatbot gave no indication as to its purpose7 The chatbot was easy to navigate8 It would be easy to get confused when using the

chatbot9 The chatbot understood me well10 The chatbot failed to recognise a lot of my inputs11 Chatbot responses were useful appropriate and

informative12 Chatbot responses were not relevant13 The chatbot coped well with any errors or mis-

takes14 The chatbot seemed unable to handle any errors15 The chatbot was very easy to use16 The chatbot was very complex

35 Questionnaire Analysis351 SUS Calculation The SUS score calculation spreadsheet [28]was used to calculate SUS scores out of 100 The formula for thiscalculation is shown in equation 1

SUS =1n

nsumi=1

normmsumj=1

qi jminus1qi jmod2gt0

5minusqi j otherwise

(1)

where n=number of subjects (questionnaires) m=10 (numberof questions) qij=individual score per question per participantnorm=25

352 UEQ Calculation The UEQ Data Analysis Tool [26] analysesquestionnaire data and presents results graphically By default theUEQ does not generate a single score for each participant but insteadprovides six scores one for each attribute To facilitate correlationanalysis using the UEQ a acircĂIJmean UEQ scoreacircĂİ from the sixscores was calculated for each participant The mean UEQ score isthe mean of the scores for all six scales of the UEQ per participant

353 CUQ Calculation CUQ scores were calculated out of 160using the formula in equation 2 and then normalized to give a scoreout of 100 to permit comparison with SUS

CUQ =

(( msumn=1

2n minus 1

)minus 5

)+

(25 minus

( msumn=1

2n

))times 16 (2)

where m = 16 (number of questions) and n = individual questionscore per participant

Chatbots Can conventional methods be used to test usability ECCE rsquo19 September 10ndash13 2019 Belfast UK

36 Data Analysis amp Presentation ToolsR Studio and Microsoft Excel were used for data analysis Box plotswere produced using R Studio and barline plots were producedusing excel Statistical tests such as correlation tests and t-testswere conducted in R Studio Correlation analysis was conductedusing Pearson correlation coefficient Mean task completion timeswere compared for significance using paired Wilcoxon analysisThe p-value threshold for statistical significance was 5 (or 005)

4 RESULTS41 WeightMentor Chatbot Usability411 System Usability Scale A total of 30 participants (healthyadults) were recruited andwho evaluated the usability of theWeight-Mentor chatbot A boxplot ofWeightMentor SUS scores is shownin Figure 2 The meanWeightMentor SUS score was 8483 plusmn 1203and the median was 8625 The highest score was 1000 and thelowest was 5750 The meanWeightMentor score places the chatbotin the 96th-100th percentile range equivalent to Grade A+ BestImaginable Acceptable and NPS Class Promoter on the variousscales discussed in methods Hence the chatbot has a high degreeof usability according to traditional usability SUS scores HoweverSUS distributions for benchmarking have not included usabilityscores from testing chatbots

Figure 2 WeightMentor SUS scores (Benchmark of 680 ismarked)

412 User Experience Questionnaire The chatbot scored highly inall UEQ scales Scores werewell above benchmark and are presentedgraphically in Figure 3 Participant scores for each scale were allabove +08 suggesting that in general participants were satisfiedwith theWeightMentor user experience

Figure 3WeightMentor UEQ scores against benchmark

413 Chatbot UsabilityQuestionnaire A box plot ofWeightMentorCUQ scores is shown in Figure 4 The mean score was 7620plusmn 1146and the median was 765 The highest score was 1000 and lowestwas 484

Figure 4WeightMentor CUQ scores

42 Usability Questionnaire CorrelationCorrelations between each usability survey score and scatter plotsare shown in Figure 5 All correlations are statistically significantsince the p-values are all below 005 (5) Multiple Regression wasused to determine if CUQ score can be determined by SUS andMean UEQ score Results are shown in Table 5

43 Chatbot Usability IssuesThirty participants identified fifty-three usability issues Usabilityissues were selected based on think-aloud data and feedback fromparticipants in the Post-Test Survey Usability issues identified perparticipant were listed in a spreadsheet in the chronological orderin which the tests were conducted All usability issues identifiedby the first participant were treated as unique and usability issuesidentified by subsequent participants were considered unique ifthey had not been identified by a previous participant All uniqueusability issues identified at each test are plotted as a line graph inFigure 6

To determine the best-case scenario participants were re-sortedin a spreadsheet in descending order based on the number of us-ability issues identified per participant Where more than one par-ticipant identified the same number of issues these were sorted inchronological order of participants Usability issues identified bythe first participant were treated as unique and subsequent issueswere counted only if they had not yet been identified by previousparticipants Identified usability issues were plotted against numberof participants as a line graph shown in Figure 6 In this scenariothe first participant identifies the most usability issues and the lastparticipant identifies the least This represents the best case (iethe least number of subjects required to identify almost all of theusability issues)

To determine the worst-case scenario participants for best casescenario were sorted in reverse order and usability issues wereidentified as for previous line graphs The graph of this scenariois also shown in Figure 6 where the first participant identified thefewest usability issues whilst the last participant identified themost usability issues

ECCE rsquo19 September 10ndash13 2019 Belfast UK Holmes Moorhead Bond Zheng Coates amp McTear

Figure 5 Scatter plots of (a) SUS and CUQ scores (b) SUS and UEQ mean scores (c) CUQ and UEQ mean scores

Table 5 Multiple regression results where SUS and UEQ are independent variables and CUQ is the dependent variable

Coefficients Estimate Std Error t-value p-value

Intercept 360128 85217 4226 0000243SUS 02578 01307 1973 0058854UEQ 103694 21264 4876 00000425Residual Standard Error 5253 on 27 degrees of freedomMultiple R-Squared 08046 Adjusted R-Squared 07901F-statistic 5558 on 2 and 27 DF p-value 00000000002681

44 Chatbot Task CompletionDuring the usability tests participantsacircĂŹ interactions with thechatbot were video recorded Task completion times were calculatedfor each individual participant along with the mean time overallA benchmark task completion time was established by recordingthe performance of the developer (SH) Two tasks (task 2 and task3) were repeated four times by participants to determine if taskcompletion times improved with each repetition Bar plots of taskcompletion times against the benchmark are shown in Figure 7The p-values for repeated tasks are shown in Table 6

5 DISCUSSIONWeightMentor mean SUS score places the chatbot at the top of thevarious scales discussed in section 331 above This suggests thatin comparison to conventional systems WeightMentor is highlyusable Similarly WeightMentor UEQ scores were favourable whencompared to UEQ benchmark However SUS and UEQ benchmarkscores are derived from tests of conventional systems such as web-sites therefore do not include any chatbot scores Thus whileWeightMentor may score highly using these metrics it is impossi-ble to determine the accuracy of this score relative to other chatbots

Chatbots Can conventional methods be used to test usability ECCE rsquo19 September 10ndash13 2019 Belfast UK

Figure 6 Usability issues identified against number of par-ticipants

Figure 7 Task completion times against benchmark (a) pertask with repetitions (b) per task overall

Correlation between the three main questionnaires was generally

Table 6 Task completion time p-values

Comparison p-value

Task 2 Attempt 1 amp Attempt 2 0002621Attempt 2 amp Attempt 3 03457Attempt 3 amp Attempt 4 08284

Task 3 Attempt 1 amp Attempt 2 006736Attempt 2 amp Attempt 3 09249Attempt 3 amp Attempt 4 01031

strong and was highest between the CUQ and UEQ Multiple Re-gression results suggest that 80 of the variance in CUQ score canbe explained by SUS and UEQ Thus 20 of the variance in CUQscore is explained by other factors ie CUQ is perhaps measuringconstructs more closely related to chatbots that is not being mea-sured by SUS and UEQ The p-value for SUS is greater than 005(5) thus it is not statistically significant in predicting CUQ scoresand the p-value for UEQ is less than 005 hence this is statisticallysignificant in predicting CUQ score This indicates that UEQ hasa greater influence on CUQ score than SUS which suggests thatUEQ and CUQ are measuring similar aspects of chatbot usabilityand UX This finding can be rationalized given that UEQ is morecomplex when compared to SUS The implication of these findingsis that SUS may be less effective for measuring chatbot usability onits own as it may not measure aspects of UX that make chatbotsparticularly usable

Analysis of identified usability issues suggested that in the best-case scenario the optimum number of users is 21 (since the functionin figure 6 plateaus thereafter) In the worst-case scenario theoptimum number of users is 30 In chronological order the optimumnumber was 26 - 29 users The mean number of users required toidentify most of the usability issues in the chatbot is 26 Nielsen ampLandauer (1993) determined that 80 of unique usability issues maybe captured by no more than 5 to 8 users however this researchconcerned conventional systems not chatbots and it may be thecase that the nature of chatbots makes it more difficult to identifyusability issues with a smaller number of participants [19]

In general task completion times did improve with each repeti-tion of a task however while the time difference was statisticallysignificant between the first attempt and the second attempt at task2 it was not statistically significant between the second and thirdand third and fourth attempts Similarly time differences were notstatistically significant between any of the attempts at task 3 whichmay be because task 3 was very similar in procedure to task 2 Thissuggests that it may be possible for users to become proficient witha new chatbot very quickly owing to their simplicity and ease ofuse

6 CONCLUSIONChatbot UX design and usability testing may require nontraditionalmethods and multiple metrics are likely to provide a more com-prehensive picture of chatbot usability The WeightMentor chatbotscored highly on both SUS and UEQ scales and correlation anal-yses suggest that correlation is stronger between CUQ and UEQthan CUQ and SUS and validation of the CUQ will increase its

ECCE rsquo19 September 10ndash13 2019 Belfast UK Holmes Moorhead Bond Zheng Coates amp McTear

effectiveness as a usability analysis tool Given that the variance ofCUQ scores are not completely explained by SUS and UEQ there isan argument that these traditional usability testing surveys do notevaluate all aspects of a chatbot interface This study also suggeststhat approximately 26 subjects are required to identify almost allthe usability issues in the chatbot which challenges previous re-search Whilst primitive this work also suggests that users becomeoptimal after just one attempt of a task when using a chatbot Thiscould be explained by the fact that chatbots can be less complex inthat they lack visual hierarchy and complexity as seen in normallygraphical user interfaces This study suggests that conventional us-ability metrics may not be best suited to assessing chatbot usabilityand that metrics will be most effective if they measure aspects ofusability that are more closely related to chatbots Chatbot usabilitytesting may also potentially require a greater number of users thansuggested by previous research in order to maximize capture of us-ability issues Finally this study suggests that users can potentiallyreach optimum proficiency with chatbots very quickly

ACKNOWLEDGMENTSThis research study is part of a PhD research project at the UlsterUniversity funded by the Department for the Economy NorthernIreland

REFERENCES[1] Aaron Bangor Philip T Kortum and James T Miller 2008 An Empirical Evalua-

tion of the System Usability Scale International Journal of Human Computer In-teraction 24 6 (0729 2008) 574ndash594 httpsdoiorg10108010447310802205776doi 10108010447310802205776

[2] Aaron Bangor Philip T Kortum and James T Miller 2009 Determining WhatIndividual SUS Scores Mean Adding an Adjective Rating Scale Journal ofUsability Studies 4 3 (2009) 114ndash123

[3] Azy Barak Britt Klein and Judith G Proudfoot 2009 Defining Internet-Supported Therapeutic Interventions Annals of Behavioral Medicine 38 1 (08012009) 4ndash17 httpsdoiorg101007s12160-009-9130-7 ID Barak2009

[4] John Brooke 1996 SUS A rsquoquick and dirtyrsquo usability scale CRC Press 189ndash194[5] Raluca Budiu 2018 The User Experience of Chatbots[6] Gillian Cameron David Cameron Gavin Megaw Raymond Bond Maurice Mul-

venna Siobhan OrsquoNeill Cherie Armour and Michael McTear 2017 Towards aChatbot for Digital Counselling In Proceedings of the 31st British Computer SocietyHuman Computer Interaction Conference (HCI rsquo17) BCS Learning amp DevelopmentLtd Swindon UK 241ndash247 httpsdoiorg1014236ewicHCI201724

[7] Gillian Cameron David Cameron Gavin Megaw RR Bond Maurice MulvennaSiobhan OrsquoNeill C Armour and Michael McTear 2018 Back to the Future Lessonsfrom Knowledge Engineering Methodologies for Chatbot Design and Developmenthttpsdoiorg1014236ewicHCI2018153

[8] Gillian Cameron David Cameron Gavin Megaw RR Bond Maurice MulvennaSiobhan OrsquoNeill C Armour andMichael McTear 2018 Best practices for designingchatbots in mental healthcare acircĂŞ A case study on iHelpr httpsdoiorg1014236ewicHCI2018129

[9] E L Donaldson S Fallows and M Morris 2014 A text message based weightmanagement intervention for overweight adults Journal of Human Nutrition ampDietetics 27 Suppl 2 (Apr 2014) 90ndash97 httpovidspovidcomathensovidwebcgiT=JSampCSC=YampNEWS=NampPAGE=fulltextampD=medlampAN=23738786

[10] AH Fadil and S Gabrielli 2017 Addressing Challenges in Promoting HealthyLifestyles The AI-Chatbot Approach

[11] Brianna S Fjeldsoe Ana D Goode Philayrath Phongsavan Adrian BaumanGenevieve Maher Elisabeth Winkler and Elizabeth G Eakin 2016 Evaluat-ing the Maintenance of Lifestyle Changes in a Randomized Controlled Trialof the rsquoGet Healthy Stay Healthyrsquo Program JMIR MHealth and UHealth 4 2(2016) e42 httpovidspovidcomathensovidwebcgiT=JSampCSC=YampNEWS=NampPAGE=fulltextampD=premampAN=27166643

[12] Aideen Gibson Claire McCauley Maurice Mulvenna Assumpta Ryan Liz LairdKevin Curran Brendan Bunting Finola Ferry and Raymond Bond 2016 As-sessing Usability Testing for People Living with Dementia In Proceedings of the4th Workshop on ICTs for Improving Patients Rehabilitation Research Techniques(REHAB rsquo16) ACM New York NY USA 25ndash31 httpsdoiorg10114530514883051492

[13] Samuel Holmes Anne Moorhead RR Bond Huiru Zheng Vivien Coates andMichael McTear 2018 A New Automated Chatbot for Weight Loss Mainte-nance In Proceedings of the 32nd International BCS Human Computer InteractionConference (HCI 2018) httpsdoiorg1014236ewicHCI2018103

[14] A Baki Kocaballi L Laranjo and E Coiera 2018 Measuring User Experience inConversational Interfaces A Comparison of Six Questionnaires In Proceedings ofthe 32nd International BCS Human Computer Interaction Conference (HCI 2018)

[15] Bettina Laugwitz Theo Held and Martin Schrepp 2008 Construction andEvaluation of a User Experience Questionnaire In HCI and Usability for Educationand Work 4th Symposium of the Workgroup Human-Computer Interaction andUsability Engineering of the Austrian Computer Society USAB Vol 5298 63ndash76httpsdoiorg101007978-3-540-89350-9_6

[16] J MartAtildeŋn C MuAtildeśoz-Romero and N AtildeĄbalos 2017 chatbottest - Improveyour chatbotrsquos design

[17] Robert J Moore Raphael Arar Guang-Jie Ren and Margaret H Szymanski 2017Conversational UX Design In Proceedings of the 2017 CHI Conference ExtendedAbstracts on Human Factors in Computing Systems (CHI EA rsquo17) ACM New YorkNY USA 492ndash497 httpsdoiorg10114530270633027077

[18] J Nielsen 1994 10 Heuristics for User Interface design[19] Jakob Nielsen and Thomas K Landauer 1993 A Mathematical Model of the

Finding of Usability Problems In Proceedings of the INTERCHI rsquo93 Conference onHuman Factors in Computing Systems (INTERCHI rsquo93) IOS Press AmsterdamThe Netherlands The Netherlands 206ndash213 httpdlacmorgcitationcfmid=164632164904

[20] Wharton University of Pennsylvania 2016 The Rise of the Chatbots Is It Timeto Embrace Them

[21] Ofcom 2017 The UK Communications Market Telecoms amp Networks TechnicalReport Ofcom

[22] Kelly Pedotto Vivey Chen and JP McElyea 2017 The 2017 US Mobile AppReport

[23] Fred F Reichheld 2003 The One Number You Need to Grow[24] Jeff Sauro 2012 Predicting Net Promoter scores from System Usability Scale

Scores[25] Jeff Sauro 2018 5 ways to interpret a SUS score[26] Martin Schrepp 2019 User Experience Questionnaire Data Analysis Tool[27] B Shneiderman C Plaisant M Cohen S Jacobs and N Elmqvist 2016 Designing

the User Interface Strategies for Effective Human-Computer Interaction (sixthedition ed) Pearson

[28] T Tullis and B Albert 2018 Measuring the User Experience SUS CalculationTool

[29] UX247 2018 Chatbots Usability Testing amp Chatbots

  • Abstract
  • 1 Introduction
    • 11 Chatbot UX Design
    • 12 Testing Usability of Chatbots
    • 13 Chatbots in Healthcare
    • 14 The WeightMentor Chatbot
      • 2 Research Questions
      • 3 Methods
        • 31 Research Design
        • 32 Chatbot Development
        • 33 Test Protocol
        • 34 Usability Metrics
        • 35 Questionnaire Analysis
        • 36 Data Analysis amp Presentation Tools
          • 4 Results
            • 41 WeightMentor Chatbot Usability
            • 42 Usability Questionnaire Correlation
            • 43 Chatbot Usability Issues
            • 44 Chatbot Task Completion
              • 5 Discussion
              • 6 Conclusion
              • Acknowledgments
              • References
Page 3: Usability testing of a healthcare chatbot: Can we use ... · bers for identifying usability issues may not apply to chatbots. The study design was a usability study. WeightMentor

Chatbots Can conventional methods be used to test usability ECCE rsquo19 September 10ndash13 2019 Belfast UK

2 RESEARCH QUESTIONS(1) How usable is the WeightMentor Chatbot according to con-

ventional usability methods(2) To what extend will different conventional usability ques-

tionnaires correlate when evaluating chatbot usability Andhow do they correlate to a tailored chatbot usability surveyscore

(3) What is the optimum number of users required to identifychatbot usability issues

(4) How many task repetitions are required for a first-time chat-bot users to reach optimum task performance (ie efficiencybased on task completion times)

Question 3 is of interest as knowing howmany subjects to recruitto identify most of the usability issues in a chatbot is importantfor planning usability studies in the UX industry Moreover it isalso important given that studies have reported that most usabilityissues of traditional human-computer systems can be identifiedwhen recruiting 5-8 subjects [19] However this may not hold truefor chatbot interfaces

3 METHODS31 Research DesignThis usability test was an observational research design testingthe usability of the chatbot and comparing a novel usability ques-tionnaire specifically designed for chatbot testing with existingquestionnaires Ethical approval for this study was obtained fromthe School of Communication amp Media Filter (Research Ethics)Committee Ulster University in October 2019

32 Chatbot DevelopmentThe WeightMentor chatbot was developed for Facebook Messen-ger as Facebook is currently the most popular social media toolon smartphones and mobile devices [21] [22] The chatbot conver-sation flow was designed using DialogFlow which is a powerfulframework integrating Googlersquos machine learning and natural lan-guage understanding capabilities The basic chatbot functionalitywas supplemented using a custom built NodeJS app hosted onHeroku

33 Test ProtocolFigure 1 shows a photograph of a typical usability testing setupThe MOD1000 camera records the userrsquos hand movements as theyinteract with the chatbot The laptop runs software to capture videofrom the MOD1000 and audio as part of the concurrent think-aloudaspect of the usability test

Usability tests were conducted as follows

(1) Participant signed the consent form(2) Participant was given access to WeightMentor(3) Test coordinator read a briefing to the participant(4) Participant completed pre-test (demographic) questionnaire(5) Test coordinator read each task to the participant(6) Participant confirmed their understanding of the task

Figure 1 Typical usability testing setup

(7) Participant answered pre-task Single Ease Question (On ascale of 1 to 7 how easy do you think it will be to completethis task)

(8) Participant was video and audio recorded completing taskParticipant talked through what they could see on screenand what they were trying to do during task (known as theconcurrent think aloud protocol)

(9) Participant answered post-task Single Ease Question (On ascale of 1 to 7 how easy was it to complete this task)

(10) After all tasks each participant completed Post-Test usabil-ity surveys including System Usability Scale (SUS) survey[4] User Experience Questionnaire (UEQ) [15] and a newChatbot Usability Questionnaire (CUQ) that we developedfor this study (refer to figure 2)

34 Usability MetricsBaki Kocaballi et al (2018) suggested that multiple metrics may bemore appropriate for measuring chatbot usability [14] Thus threemetrics were selected to evaluate the usability ofWeightMentor theSUS scores UEQ metrics and our own CUQ score (although CUQ isyet to be validated but is has been designed to measure the usabilityof chatbots as opposed to measuring the general usability of human-computer systems) In addition wemeasured task completion times

ECCE rsquo19 September 10ndash13 2019 Belfast UK Holmes Moorhead Bond Zheng Coates amp McTear

usability issues and the differential between pre-task SEQ answersand post-task SEQ answers

341 System Usability Scale (SUS) SUS was designed as a quickand easy means of assessing usability [4] and today is one of themost commonly used usability testing tools SUS is comprised often validated statements covering five positive aspects and fivenegative aspects of the system Participants score each questionout of five Final scores are out of 100 and may be compared withthe SUS benchmark (currently 680 representing an average score)or interpreted using several possible scales

SUS scores can be grouped into percentile ranges [25] Alterna-tively a grade system may be used Bangor et al (2009) proposean adjective based scale ranging from Worst Imaginable (SUSgrades 0 - 25) to Best Imaginable (SUS grades over 841) [2] TheAcceptability Scale [1] ranges from Not Acceptable (SUS lt 626)to Acceptable (SUS gt 711) Finally SUS scores may be linked withthe Net Promoter Score (NPS) designed for measuring customerloyalty [23] SUS scores greater than 788 may be classed as Pro-moters below 627 will be Detractors and scores in between arePassive [24]

The benchmark SUS score falls within the 41st - 59th percentileIt is grade C Good on the adjective scale Marginal on the ac-ceptability scale and NPC classification Passive

342 User Experience Questionnaire (UEQ) The UEQ serves as ameans of comprehensively assessing the UX [15] It is based onsix scales summarised in Table 3 Scales are measured using pairsof opposite adjectives to describe the system with participantsselecting their level of agreement with each UEQ scores assess theextent to which the system meets expectations but more usefullymay be compared with a benchmark to determine how the systemunder test compares to other systems

Table 3 UEQ Scales

Scale What does it measure

Attractiveness Extent to which users like the systemPerspicuity Ease of learning and becoming proficient

in the systemEfficiency Effort required to complete tasks System

reaction timesDependability Extent of user control System predictabil-

itysecurityStimulation How funexciting is it to use the systemNovelty CreativenessInterest for users

343 Chatbot Usability Questionnaire (CUQ) The CUQ is basedon the chatbot UX principles provided by the ALMA Chatbot Testtool [16] which assess personality onboarding navigation under-standing responses error handling and intelligence of a chatbotThe CUQ is designed to be comparable to SUS except it is bespokefor chatbots and includes 16 items Participantsrsquo levels of agreementwith sixteen statements relating to positive and negative aspectsof the chatbot are ranked out of five from Strongly Disagree toStrongly Agree Statements used in the CUQ are listed in Table 4

Table 4 Statements used in the novel and bespoke but rsquoun-validatedrsquo Chatbot Usability Questionnaire (CUQ)

QuestionNumber

Question

1 The chatbotrsquos personality was realistic and en-gaging

2 The chatbot seemed too robotic3 The chatbot was welcoming during initial setup4 The chatbot seemed very unfriendly5 The chatbot explained its scope and purpose well6 The chatbot gave no indication as to its purpose7 The chatbot was easy to navigate8 It would be easy to get confused when using the

chatbot9 The chatbot understood me well10 The chatbot failed to recognise a lot of my inputs11 Chatbot responses were useful appropriate and

informative12 Chatbot responses were not relevant13 The chatbot coped well with any errors or mis-

takes14 The chatbot seemed unable to handle any errors15 The chatbot was very easy to use16 The chatbot was very complex

35 Questionnaire Analysis351 SUS Calculation The SUS score calculation spreadsheet [28]was used to calculate SUS scores out of 100 The formula for thiscalculation is shown in equation 1

SUS =1n

nsumi=1

normmsumj=1

qi jminus1qi jmod2gt0

5minusqi j otherwise

(1)

where n=number of subjects (questionnaires) m=10 (numberof questions) qij=individual score per question per participantnorm=25

352 UEQ Calculation The UEQ Data Analysis Tool [26] analysesquestionnaire data and presents results graphically By default theUEQ does not generate a single score for each participant but insteadprovides six scores one for each attribute To facilitate correlationanalysis using the UEQ a acircĂIJmean UEQ scoreacircĂİ from the sixscores was calculated for each participant The mean UEQ score isthe mean of the scores for all six scales of the UEQ per participant

353 CUQ Calculation CUQ scores were calculated out of 160using the formula in equation 2 and then normalized to give a scoreout of 100 to permit comparison with SUS

CUQ =

(( msumn=1

2n minus 1

)minus 5

)+

(25 minus

( msumn=1

2n

))times 16 (2)

where m = 16 (number of questions) and n = individual questionscore per participant

Chatbots Can conventional methods be used to test usability ECCE rsquo19 September 10ndash13 2019 Belfast UK

36 Data Analysis amp Presentation ToolsR Studio and Microsoft Excel were used for data analysis Box plotswere produced using R Studio and barline plots were producedusing excel Statistical tests such as correlation tests and t-testswere conducted in R Studio Correlation analysis was conductedusing Pearson correlation coefficient Mean task completion timeswere compared for significance using paired Wilcoxon analysisThe p-value threshold for statistical significance was 5 (or 005)

4 RESULTS41 WeightMentor Chatbot Usability411 System Usability Scale A total of 30 participants (healthyadults) were recruited andwho evaluated the usability of theWeight-Mentor chatbot A boxplot ofWeightMentor SUS scores is shownin Figure 2 The meanWeightMentor SUS score was 8483 plusmn 1203and the median was 8625 The highest score was 1000 and thelowest was 5750 The meanWeightMentor score places the chatbotin the 96th-100th percentile range equivalent to Grade A+ BestImaginable Acceptable and NPS Class Promoter on the variousscales discussed in methods Hence the chatbot has a high degreeof usability according to traditional usability SUS scores HoweverSUS distributions for benchmarking have not included usabilityscores from testing chatbots

Figure 2 WeightMentor SUS scores (Benchmark of 680 ismarked)

412 User Experience Questionnaire The chatbot scored highly inall UEQ scales Scores werewell above benchmark and are presentedgraphically in Figure 3 Participant scores for each scale were allabove +08 suggesting that in general participants were satisfiedwith theWeightMentor user experience

Figure 3WeightMentor UEQ scores against benchmark

413 Chatbot UsabilityQuestionnaire A box plot ofWeightMentorCUQ scores is shown in Figure 4 The mean score was 7620plusmn 1146and the median was 765 The highest score was 1000 and lowestwas 484

Figure 4WeightMentor CUQ scores

42 Usability Questionnaire CorrelationCorrelations between each usability survey score and scatter plotsare shown in Figure 5 All correlations are statistically significantsince the p-values are all below 005 (5) Multiple Regression wasused to determine if CUQ score can be determined by SUS andMean UEQ score Results are shown in Table 5

43 Chatbot Usability IssuesThirty participants identified fifty-three usability issues Usabilityissues were selected based on think-aloud data and feedback fromparticipants in the Post-Test Survey Usability issues identified perparticipant were listed in a spreadsheet in the chronological orderin which the tests were conducted All usability issues identifiedby the first participant were treated as unique and usability issuesidentified by subsequent participants were considered unique ifthey had not been identified by a previous participant All uniqueusability issues identified at each test are plotted as a line graph inFigure 6

To determine the best-case scenario participants were re-sortedin a spreadsheet in descending order based on the number of us-ability issues identified per participant Where more than one par-ticipant identified the same number of issues these were sorted inchronological order of participants Usability issues identified bythe first participant were treated as unique and subsequent issueswere counted only if they had not yet been identified by previousparticipants Identified usability issues were plotted against numberof participants as a line graph shown in Figure 6 In this scenariothe first participant identifies the most usability issues and the lastparticipant identifies the least This represents the best case (iethe least number of subjects required to identify almost all of theusability issues)

To determine the worst-case scenario participants for best casescenario were sorted in reverse order and usability issues wereidentified as for previous line graphs The graph of this scenariois also shown in Figure 6 where the first participant identified thefewest usability issues whilst the last participant identified themost usability issues

ECCE rsquo19 September 10ndash13 2019 Belfast UK Holmes Moorhead Bond Zheng Coates amp McTear

Figure 5 Scatter plots of (a) SUS and CUQ scores (b) SUS and UEQ mean scores (c) CUQ and UEQ mean scores

Table 5 Multiple regression results where SUS and UEQ are independent variables and CUQ is the dependent variable

Coefficients Estimate Std Error t-value p-value

Intercept 360128 85217 4226 0000243SUS 02578 01307 1973 0058854UEQ 103694 21264 4876 00000425Residual Standard Error 5253 on 27 degrees of freedomMultiple R-Squared 08046 Adjusted R-Squared 07901F-statistic 5558 on 2 and 27 DF p-value 00000000002681

44 Chatbot Task CompletionDuring the usability tests participantsacircĂŹ interactions with thechatbot were video recorded Task completion times were calculatedfor each individual participant along with the mean time overallA benchmark task completion time was established by recordingthe performance of the developer (SH) Two tasks (task 2 and task3) were repeated four times by participants to determine if taskcompletion times improved with each repetition Bar plots of taskcompletion times against the benchmark are shown in Figure 7The p-values for repeated tasks are shown in Table 6

5 DISCUSSIONWeightMentor mean SUS score places the chatbot at the top of thevarious scales discussed in section 331 above This suggests thatin comparison to conventional systems WeightMentor is highlyusable Similarly WeightMentor UEQ scores were favourable whencompared to UEQ benchmark However SUS and UEQ benchmarkscores are derived from tests of conventional systems such as web-sites therefore do not include any chatbot scores Thus whileWeightMentor may score highly using these metrics it is impossi-ble to determine the accuracy of this score relative to other chatbots

Chatbots Can conventional methods be used to test usability ECCE rsquo19 September 10ndash13 2019 Belfast UK

Figure 6 Usability issues identified against number of par-ticipants

Figure 7 Task completion times against benchmark (a) pertask with repetitions (b) per task overall

Correlation between the three main questionnaires was generally

Table 6 Task completion time p-values

Comparison p-value

Task 2 Attempt 1 amp Attempt 2 0002621Attempt 2 amp Attempt 3 03457Attempt 3 amp Attempt 4 08284

Task 3 Attempt 1 amp Attempt 2 006736Attempt 2 amp Attempt 3 09249Attempt 3 amp Attempt 4 01031

strong and was highest between the CUQ and UEQ Multiple Re-gression results suggest that 80 of the variance in CUQ score canbe explained by SUS and UEQ Thus 20 of the variance in CUQscore is explained by other factors ie CUQ is perhaps measuringconstructs more closely related to chatbots that is not being mea-sured by SUS and UEQ The p-value for SUS is greater than 005(5) thus it is not statistically significant in predicting CUQ scoresand the p-value for UEQ is less than 005 hence this is statisticallysignificant in predicting CUQ score This indicates that UEQ hasa greater influence on CUQ score than SUS which suggests thatUEQ and CUQ are measuring similar aspects of chatbot usabilityand UX This finding can be rationalized given that UEQ is morecomplex when compared to SUS The implication of these findingsis that SUS may be less effective for measuring chatbot usability onits own as it may not measure aspects of UX that make chatbotsparticularly usable

Analysis of identified usability issues suggested that in the best-case scenario the optimum number of users is 21 (since the functionin figure 6 plateaus thereafter) In the worst-case scenario theoptimum number of users is 30 In chronological order the optimumnumber was 26 - 29 users The mean number of users required toidentify most of the usability issues in the chatbot is 26 Nielsen ampLandauer (1993) determined that 80 of unique usability issues maybe captured by no more than 5 to 8 users however this researchconcerned conventional systems not chatbots and it may be thecase that the nature of chatbots makes it more difficult to identifyusability issues with a smaller number of participants [19]

In general task completion times did improve with each repeti-tion of a task however while the time difference was statisticallysignificant between the first attempt and the second attempt at task2 it was not statistically significant between the second and thirdand third and fourth attempts Similarly time differences were notstatistically significant between any of the attempts at task 3 whichmay be because task 3 was very similar in procedure to task 2 Thissuggests that it may be possible for users to become proficient witha new chatbot very quickly owing to their simplicity and ease ofuse

6 CONCLUSIONChatbot UX design and usability testing may require nontraditionalmethods and multiple metrics are likely to provide a more com-prehensive picture of chatbot usability The WeightMentor chatbotscored highly on both SUS and UEQ scales and correlation anal-yses suggest that correlation is stronger between CUQ and UEQthan CUQ and SUS and validation of the CUQ will increase its

ECCE rsquo19 September 10ndash13 2019 Belfast UK Holmes Moorhead Bond Zheng Coates amp McTear

effectiveness as a usability analysis tool Given that the variance ofCUQ scores are not completely explained by SUS and UEQ there isan argument that these traditional usability testing surveys do notevaluate all aspects of a chatbot interface This study also suggeststhat approximately 26 subjects are required to identify almost allthe usability issues in the chatbot which challenges previous re-search Whilst primitive this work also suggests that users becomeoptimal after just one attempt of a task when using a chatbot Thiscould be explained by the fact that chatbots can be less complex inthat they lack visual hierarchy and complexity as seen in normallygraphical user interfaces This study suggests that conventional us-ability metrics may not be best suited to assessing chatbot usabilityand that metrics will be most effective if they measure aspects ofusability that are more closely related to chatbots Chatbot usabilitytesting may also potentially require a greater number of users thansuggested by previous research in order to maximize capture of us-ability issues Finally this study suggests that users can potentiallyreach optimum proficiency with chatbots very quickly

ACKNOWLEDGMENTSThis research study is part of a PhD research project at the UlsterUniversity funded by the Department for the Economy NorthernIreland

REFERENCES[1] Aaron Bangor Philip T Kortum and James T Miller 2008 An Empirical Evalua-

tion of the System Usability Scale International Journal of Human Computer In-teraction 24 6 (0729 2008) 574ndash594 httpsdoiorg10108010447310802205776doi 10108010447310802205776

[2] Aaron Bangor Philip T Kortum and James T Miller 2009 Determining WhatIndividual SUS Scores Mean Adding an Adjective Rating Scale Journal ofUsability Studies 4 3 (2009) 114ndash123

[3] Azy Barak Britt Klein and Judith G Proudfoot 2009 Defining Internet-Supported Therapeutic Interventions Annals of Behavioral Medicine 38 1 (08012009) 4ndash17 httpsdoiorg101007s12160-009-9130-7 ID Barak2009

[4] John Brooke 1996 SUS A rsquoquick and dirtyrsquo usability scale CRC Press 189ndash194[5] Raluca Budiu 2018 The User Experience of Chatbots[6] Gillian Cameron David Cameron Gavin Megaw Raymond Bond Maurice Mul-

venna Siobhan OrsquoNeill Cherie Armour and Michael McTear 2017 Towards aChatbot for Digital Counselling In Proceedings of the 31st British Computer SocietyHuman Computer Interaction Conference (HCI rsquo17) BCS Learning amp DevelopmentLtd Swindon UK 241ndash247 httpsdoiorg1014236ewicHCI201724

[7] Gillian Cameron David Cameron Gavin Megaw RR Bond Maurice MulvennaSiobhan OrsquoNeill C Armour and Michael McTear 2018 Back to the Future Lessonsfrom Knowledge Engineering Methodologies for Chatbot Design and Developmenthttpsdoiorg1014236ewicHCI2018153

[8] Gillian Cameron David Cameron Gavin Megaw RR Bond Maurice MulvennaSiobhan OrsquoNeill C Armour andMichael McTear 2018 Best practices for designingchatbots in mental healthcare acircĂŞ A case study on iHelpr httpsdoiorg1014236ewicHCI2018129

[9] E L Donaldson S Fallows and M Morris 2014 A text message based weightmanagement intervention for overweight adults Journal of Human Nutrition ampDietetics 27 Suppl 2 (Apr 2014) 90ndash97 httpovidspovidcomathensovidwebcgiT=JSampCSC=YampNEWS=NampPAGE=fulltextampD=medlampAN=23738786

[10] AH Fadil and S Gabrielli 2017 Addressing Challenges in Promoting HealthyLifestyles The AI-Chatbot Approach

[11] Brianna S Fjeldsoe Ana D Goode Philayrath Phongsavan Adrian BaumanGenevieve Maher Elisabeth Winkler and Elizabeth G Eakin 2016 Evaluat-ing the Maintenance of Lifestyle Changes in a Randomized Controlled Trialof the rsquoGet Healthy Stay Healthyrsquo Program JMIR MHealth and UHealth 4 2(2016) e42 httpovidspovidcomathensovidwebcgiT=JSampCSC=YampNEWS=NampPAGE=fulltextampD=premampAN=27166643

[12] Aideen Gibson Claire McCauley Maurice Mulvenna Assumpta Ryan Liz LairdKevin Curran Brendan Bunting Finola Ferry and Raymond Bond 2016 As-sessing Usability Testing for People Living with Dementia In Proceedings of the4th Workshop on ICTs for Improving Patients Rehabilitation Research Techniques(REHAB rsquo16) ACM New York NY USA 25ndash31 httpsdoiorg10114530514883051492

[13] Samuel Holmes Anne Moorhead RR Bond Huiru Zheng Vivien Coates andMichael McTear 2018 A New Automated Chatbot for Weight Loss Mainte-nance In Proceedings of the 32nd International BCS Human Computer InteractionConference (HCI 2018) httpsdoiorg1014236ewicHCI2018103

[14] A Baki Kocaballi L Laranjo and E Coiera 2018 Measuring User Experience inConversational Interfaces A Comparison of Six Questionnaires In Proceedings ofthe 32nd International BCS Human Computer Interaction Conference (HCI 2018)

[15] Bettina Laugwitz Theo Held and Martin Schrepp 2008 Construction andEvaluation of a User Experience Questionnaire In HCI and Usability for Educationand Work 4th Symposium of the Workgroup Human-Computer Interaction andUsability Engineering of the Austrian Computer Society USAB Vol 5298 63ndash76httpsdoiorg101007978-3-540-89350-9_6

[16] J MartAtildeŋn C MuAtildeśoz-Romero and N AtildeĄbalos 2017 chatbottest - Improveyour chatbotrsquos design

[17] Robert J Moore Raphael Arar Guang-Jie Ren and Margaret H Szymanski 2017Conversational UX Design In Proceedings of the 2017 CHI Conference ExtendedAbstracts on Human Factors in Computing Systems (CHI EA rsquo17) ACM New YorkNY USA 492ndash497 httpsdoiorg10114530270633027077

[18] J Nielsen 1994 10 Heuristics for User Interface design[19] Jakob Nielsen and Thomas K Landauer 1993 A Mathematical Model of the

Finding of Usability Problems In Proceedings of the INTERCHI rsquo93 Conference onHuman Factors in Computing Systems (INTERCHI rsquo93) IOS Press AmsterdamThe Netherlands The Netherlands 206ndash213 httpdlacmorgcitationcfmid=164632164904

[20] Wharton University of Pennsylvania 2016 The Rise of the Chatbots Is It Timeto Embrace Them

[21] Ofcom 2017 The UK Communications Market Telecoms amp Networks TechnicalReport Ofcom

[22] Kelly Pedotto Vivey Chen and JP McElyea 2017 The 2017 US Mobile AppReport

[23] Fred F Reichheld 2003 The One Number You Need to Grow[24] Jeff Sauro 2012 Predicting Net Promoter scores from System Usability Scale

Scores[25] Jeff Sauro 2018 5 ways to interpret a SUS score[26] Martin Schrepp 2019 User Experience Questionnaire Data Analysis Tool[27] B Shneiderman C Plaisant M Cohen S Jacobs and N Elmqvist 2016 Designing

the User Interface Strategies for Effective Human-Computer Interaction (sixthedition ed) Pearson

[28] T Tullis and B Albert 2018 Measuring the User Experience SUS CalculationTool

[29] UX247 2018 Chatbots Usability Testing amp Chatbots

  • Abstract
  • 1 Introduction
    • 11 Chatbot UX Design
    • 12 Testing Usability of Chatbots
    • 13 Chatbots in Healthcare
    • 14 The WeightMentor Chatbot
      • 2 Research Questions
      • 3 Methods
        • 31 Research Design
        • 32 Chatbot Development
        • 33 Test Protocol
        • 34 Usability Metrics
        • 35 Questionnaire Analysis
        • 36 Data Analysis amp Presentation Tools
          • 4 Results
            • 41 WeightMentor Chatbot Usability
            • 42 Usability Questionnaire Correlation
            • 43 Chatbot Usability Issues
            • 44 Chatbot Task Completion
              • 5 Discussion
              • 6 Conclusion
              • Acknowledgments
              • References
Page 4: Usability testing of a healthcare chatbot: Can we use ... · bers for identifying usability issues may not apply to chatbots. The study design was a usability study. WeightMentor

ECCE rsquo19 September 10ndash13 2019 Belfast UK Holmes Moorhead Bond Zheng Coates amp McTear

usability issues and the differential between pre-task SEQ answersand post-task SEQ answers

341 System Usability Scale (SUS) SUS was designed as a quickand easy means of assessing usability [4] and today is one of themost commonly used usability testing tools SUS is comprised often validated statements covering five positive aspects and fivenegative aspects of the system Participants score each questionout of five Final scores are out of 100 and may be compared withthe SUS benchmark (currently 680 representing an average score)or interpreted using several possible scales

SUS scores can be grouped into percentile ranges [25] Alterna-tively a grade system may be used Bangor et al (2009) proposean adjective based scale ranging from Worst Imaginable (SUSgrades 0 - 25) to Best Imaginable (SUS grades over 841) [2] TheAcceptability Scale [1] ranges from Not Acceptable (SUS lt 626)to Acceptable (SUS gt 711) Finally SUS scores may be linked withthe Net Promoter Score (NPS) designed for measuring customerloyalty [23] SUS scores greater than 788 may be classed as Pro-moters below 627 will be Detractors and scores in between arePassive [24]

The benchmark SUS score falls within the 41st - 59th percentileIt is grade C Good on the adjective scale Marginal on the ac-ceptability scale and NPC classification Passive

342 User Experience Questionnaire (UEQ) The UEQ serves as ameans of comprehensively assessing the UX [15] It is based onsix scales summarised in Table 3 Scales are measured using pairsof opposite adjectives to describe the system with participantsselecting their level of agreement with each UEQ scores assess theextent to which the system meets expectations but more usefullymay be compared with a benchmark to determine how the systemunder test compares to other systems

Table 3 UEQ Scales

Scale What does it measure

Attractiveness Extent to which users like the systemPerspicuity Ease of learning and becoming proficient

in the systemEfficiency Effort required to complete tasks System

reaction timesDependability Extent of user control System predictabil-

itysecurityStimulation How funexciting is it to use the systemNovelty CreativenessInterest for users

343 Chatbot Usability Questionnaire (CUQ) The CUQ is basedon the chatbot UX principles provided by the ALMA Chatbot Testtool [16] which assess personality onboarding navigation under-standing responses error handling and intelligence of a chatbotThe CUQ is designed to be comparable to SUS except it is bespokefor chatbots and includes 16 items Participantsrsquo levels of agreementwith sixteen statements relating to positive and negative aspectsof the chatbot are ranked out of five from Strongly Disagree toStrongly Agree Statements used in the CUQ are listed in Table 4

Table 4 Statements used in the novel and bespoke but rsquoun-validatedrsquo Chatbot Usability Questionnaire (CUQ)

QuestionNumber

Question

1 The chatbotrsquos personality was realistic and en-gaging

2 The chatbot seemed too robotic3 The chatbot was welcoming during initial setup4 The chatbot seemed very unfriendly5 The chatbot explained its scope and purpose well6 The chatbot gave no indication as to its purpose7 The chatbot was easy to navigate8 It would be easy to get confused when using the

chatbot9 The chatbot understood me well10 The chatbot failed to recognise a lot of my inputs11 Chatbot responses were useful appropriate and

informative12 Chatbot responses were not relevant13 The chatbot coped well with any errors or mis-

takes14 The chatbot seemed unable to handle any errors15 The chatbot was very easy to use16 The chatbot was very complex

35 Questionnaire Analysis351 SUS Calculation The SUS score calculation spreadsheet [28]was used to calculate SUS scores out of 100 The formula for thiscalculation is shown in equation 1

SUS =1n

nsumi=1

normmsumj=1

qi jminus1qi jmod2gt0

5minusqi j otherwise

(1)

where n=number of subjects (questionnaires) m=10 (numberof questions) qij=individual score per question per participantnorm=25

352 UEQ Calculation The UEQ Data Analysis Tool [26] analysesquestionnaire data and presents results graphically By default theUEQ does not generate a single score for each participant but insteadprovides six scores one for each attribute To facilitate correlationanalysis using the UEQ a acircĂIJmean UEQ scoreacircĂİ from the sixscores was calculated for each participant The mean UEQ score isthe mean of the scores for all six scales of the UEQ per participant

353 CUQ Calculation CUQ scores were calculated out of 160using the formula in equation 2 and then normalized to give a scoreout of 100 to permit comparison with SUS

CUQ =

(( msumn=1

2n minus 1

)minus 5

)+

(25 minus

( msumn=1

2n

))times 16 (2)

where m = 16 (number of questions) and n = individual questionscore per participant

Chatbots Can conventional methods be used to test usability ECCE rsquo19 September 10ndash13 2019 Belfast UK

36 Data Analysis amp Presentation ToolsR Studio and Microsoft Excel were used for data analysis Box plotswere produced using R Studio and barline plots were producedusing excel Statistical tests such as correlation tests and t-testswere conducted in R Studio Correlation analysis was conductedusing Pearson correlation coefficient Mean task completion timeswere compared for significance using paired Wilcoxon analysisThe p-value threshold for statistical significance was 5 (or 005)

4 RESULTS41 WeightMentor Chatbot Usability411 System Usability Scale A total of 30 participants (healthyadults) were recruited andwho evaluated the usability of theWeight-Mentor chatbot A boxplot ofWeightMentor SUS scores is shownin Figure 2 The meanWeightMentor SUS score was 8483 plusmn 1203and the median was 8625 The highest score was 1000 and thelowest was 5750 The meanWeightMentor score places the chatbotin the 96th-100th percentile range equivalent to Grade A+ BestImaginable Acceptable and NPS Class Promoter on the variousscales discussed in methods Hence the chatbot has a high degreeof usability according to traditional usability SUS scores HoweverSUS distributions for benchmarking have not included usabilityscores from testing chatbots

Figure 2 WeightMentor SUS scores (Benchmark of 680 ismarked)

412 User Experience Questionnaire The chatbot scored highly inall UEQ scales Scores werewell above benchmark and are presentedgraphically in Figure 3 Participant scores for each scale were allabove +08 suggesting that in general participants were satisfiedwith theWeightMentor user experience

Figure 3WeightMentor UEQ scores against benchmark

413 Chatbot UsabilityQuestionnaire A box plot ofWeightMentorCUQ scores is shown in Figure 4 The mean score was 7620plusmn 1146and the median was 765 The highest score was 1000 and lowestwas 484

Figure 4WeightMentor CUQ scores

42 Usability Questionnaire CorrelationCorrelations between each usability survey score and scatter plotsare shown in Figure 5 All correlations are statistically significantsince the p-values are all below 005 (5) Multiple Regression wasused to determine if CUQ score can be determined by SUS andMean UEQ score Results are shown in Table 5

43 Chatbot Usability IssuesThirty participants identified fifty-three usability issues Usabilityissues were selected based on think-aloud data and feedback fromparticipants in the Post-Test Survey Usability issues identified perparticipant were listed in a spreadsheet in the chronological orderin which the tests were conducted All usability issues identifiedby the first participant were treated as unique and usability issuesidentified by subsequent participants were considered unique ifthey had not been identified by a previous participant All uniqueusability issues identified at each test are plotted as a line graph inFigure 6

To determine the best-case scenario participants were re-sortedin a spreadsheet in descending order based on the number of us-ability issues identified per participant Where more than one par-ticipant identified the same number of issues these were sorted inchronological order of participants Usability issues identified bythe first participant were treated as unique and subsequent issueswere counted only if they had not yet been identified by previousparticipants Identified usability issues were plotted against numberof participants as a line graph shown in Figure 6 In this scenariothe first participant identifies the most usability issues and the lastparticipant identifies the least This represents the best case (iethe least number of subjects required to identify almost all of theusability issues)

To determine the worst-case scenario participants for best casescenario were sorted in reverse order and usability issues wereidentified as for previous line graphs The graph of this scenariois also shown in Figure 6 where the first participant identified thefewest usability issues whilst the last participant identified themost usability issues

ECCE rsquo19 September 10ndash13 2019 Belfast UK Holmes Moorhead Bond Zheng Coates amp McTear

Figure 5 Scatter plots of (a) SUS and CUQ scores (b) SUS and UEQ mean scores (c) CUQ and UEQ mean scores

Table 5 Multiple regression results where SUS and UEQ are independent variables and CUQ is the dependent variable

Coefficients Estimate Std Error t-value p-value

Intercept 360128 85217 4226 0000243SUS 02578 01307 1973 0058854UEQ 103694 21264 4876 00000425Residual Standard Error 5253 on 27 degrees of freedomMultiple R-Squared 08046 Adjusted R-Squared 07901F-statistic 5558 on 2 and 27 DF p-value 00000000002681

44 Chatbot Task CompletionDuring the usability tests participantsacircĂŹ interactions with thechatbot were video recorded Task completion times were calculatedfor each individual participant along with the mean time overallA benchmark task completion time was established by recordingthe performance of the developer (SH) Two tasks (task 2 and task3) were repeated four times by participants to determine if taskcompletion times improved with each repetition Bar plots of taskcompletion times against the benchmark are shown in Figure 7The p-values for repeated tasks are shown in Table 6

5 DISCUSSIONWeightMentor mean SUS score places the chatbot at the top of thevarious scales discussed in section 331 above This suggests thatin comparison to conventional systems WeightMentor is highlyusable Similarly WeightMentor UEQ scores were favourable whencompared to UEQ benchmark However SUS and UEQ benchmarkscores are derived from tests of conventional systems such as web-sites therefore do not include any chatbot scores Thus whileWeightMentor may score highly using these metrics it is impossi-ble to determine the accuracy of this score relative to other chatbots

Chatbots Can conventional methods be used to test usability ECCE rsquo19 September 10ndash13 2019 Belfast UK

Figure 6 Usability issues identified against number of par-ticipants

Figure 7 Task completion times against benchmark (a) pertask with repetitions (b) per task overall

Correlation between the three main questionnaires was generally

Table 6 Task completion time p-values

Comparison p-value

Task 2 Attempt 1 amp Attempt 2 0002621Attempt 2 amp Attempt 3 03457Attempt 3 amp Attempt 4 08284

Task 3 Attempt 1 amp Attempt 2 006736Attempt 2 amp Attempt 3 09249Attempt 3 amp Attempt 4 01031

strong and was highest between the CUQ and UEQ Multiple Re-gression results suggest that 80 of the variance in CUQ score canbe explained by SUS and UEQ Thus 20 of the variance in CUQscore is explained by other factors ie CUQ is perhaps measuringconstructs more closely related to chatbots that is not being mea-sured by SUS and UEQ The p-value for SUS is greater than 005(5) thus it is not statistically significant in predicting CUQ scoresand the p-value for UEQ is less than 005 hence this is statisticallysignificant in predicting CUQ score This indicates that UEQ hasa greater influence on CUQ score than SUS which suggests thatUEQ and CUQ are measuring similar aspects of chatbot usabilityand UX This finding can be rationalized given that UEQ is morecomplex when compared to SUS The implication of these findingsis that SUS may be less effective for measuring chatbot usability onits own as it may not measure aspects of UX that make chatbotsparticularly usable

Analysis of identified usability issues suggested that in the best-case scenario the optimum number of users is 21 (since the functionin figure 6 plateaus thereafter) In the worst-case scenario theoptimum number of users is 30 In chronological order the optimumnumber was 26 - 29 users The mean number of users required toidentify most of the usability issues in the chatbot is 26 Nielsen ampLandauer (1993) determined that 80 of unique usability issues maybe captured by no more than 5 to 8 users however this researchconcerned conventional systems not chatbots and it may be thecase that the nature of chatbots makes it more difficult to identifyusability issues with a smaller number of participants [19]

In general task completion times did improve with each repeti-tion of a task however while the time difference was statisticallysignificant between the first attempt and the second attempt at task2 it was not statistically significant between the second and thirdand third and fourth attempts Similarly time differences were notstatistically significant between any of the attempts at task 3 whichmay be because task 3 was very similar in procedure to task 2 Thissuggests that it may be possible for users to become proficient witha new chatbot very quickly owing to their simplicity and ease ofuse

6 CONCLUSIONChatbot UX design and usability testing may require nontraditionalmethods and multiple metrics are likely to provide a more com-prehensive picture of chatbot usability The WeightMentor chatbotscored highly on both SUS and UEQ scales and correlation anal-yses suggest that correlation is stronger between CUQ and UEQthan CUQ and SUS and validation of the CUQ will increase its

ECCE rsquo19 September 10ndash13 2019 Belfast UK Holmes Moorhead Bond Zheng Coates amp McTear

effectiveness as a usability analysis tool Given that the variance ofCUQ scores are not completely explained by SUS and UEQ there isan argument that these traditional usability testing surveys do notevaluate all aspects of a chatbot interface This study also suggeststhat approximately 26 subjects are required to identify almost allthe usability issues in the chatbot which challenges previous re-search Whilst primitive this work also suggests that users becomeoptimal after just one attempt of a task when using a chatbot Thiscould be explained by the fact that chatbots can be less complex inthat they lack visual hierarchy and complexity as seen in normallygraphical user interfaces This study suggests that conventional us-ability metrics may not be best suited to assessing chatbot usabilityand that metrics will be most effective if they measure aspects ofusability that are more closely related to chatbots Chatbot usabilitytesting may also potentially require a greater number of users thansuggested by previous research in order to maximize capture of us-ability issues Finally this study suggests that users can potentiallyreach optimum proficiency with chatbots very quickly

ACKNOWLEDGMENTSThis research study is part of a PhD research project at the UlsterUniversity funded by the Department for the Economy NorthernIreland

REFERENCES[1] Aaron Bangor Philip T Kortum and James T Miller 2008 An Empirical Evalua-

tion of the System Usability Scale International Journal of Human Computer In-teraction 24 6 (0729 2008) 574ndash594 httpsdoiorg10108010447310802205776doi 10108010447310802205776

[2] Aaron Bangor Philip T Kortum and James T Miller 2009 Determining WhatIndividual SUS Scores Mean Adding an Adjective Rating Scale Journal ofUsability Studies 4 3 (2009) 114ndash123

[3] Azy Barak Britt Klein and Judith G Proudfoot 2009 Defining Internet-Supported Therapeutic Interventions Annals of Behavioral Medicine 38 1 (08012009) 4ndash17 httpsdoiorg101007s12160-009-9130-7 ID Barak2009

[4] John Brooke 1996 SUS A rsquoquick and dirtyrsquo usability scale CRC Press 189ndash194[5] Raluca Budiu 2018 The User Experience of Chatbots[6] Gillian Cameron David Cameron Gavin Megaw Raymond Bond Maurice Mul-

venna Siobhan OrsquoNeill Cherie Armour and Michael McTear 2017 Towards aChatbot for Digital Counselling In Proceedings of the 31st British Computer SocietyHuman Computer Interaction Conference (HCI rsquo17) BCS Learning amp DevelopmentLtd Swindon UK 241ndash247 httpsdoiorg1014236ewicHCI201724

[7] Gillian Cameron David Cameron Gavin Megaw RR Bond Maurice MulvennaSiobhan OrsquoNeill C Armour and Michael McTear 2018 Back to the Future Lessonsfrom Knowledge Engineering Methodologies for Chatbot Design and Developmenthttpsdoiorg1014236ewicHCI2018153

[8] Gillian Cameron David Cameron Gavin Megaw RR Bond Maurice MulvennaSiobhan OrsquoNeill C Armour andMichael McTear 2018 Best practices for designingchatbots in mental healthcare acircĂŞ A case study on iHelpr httpsdoiorg1014236ewicHCI2018129

[9] E L Donaldson S Fallows and M Morris 2014 A text message based weightmanagement intervention for overweight adults Journal of Human Nutrition ampDietetics 27 Suppl 2 (Apr 2014) 90ndash97 httpovidspovidcomathensovidwebcgiT=JSampCSC=YampNEWS=NampPAGE=fulltextampD=medlampAN=23738786

[10] AH Fadil and S Gabrielli 2017 Addressing Challenges in Promoting HealthyLifestyles The AI-Chatbot Approach

[11] Brianna S Fjeldsoe Ana D Goode Philayrath Phongsavan Adrian BaumanGenevieve Maher Elisabeth Winkler and Elizabeth G Eakin 2016 Evaluat-ing the Maintenance of Lifestyle Changes in a Randomized Controlled Trialof the rsquoGet Healthy Stay Healthyrsquo Program JMIR MHealth and UHealth 4 2(2016) e42 httpovidspovidcomathensovidwebcgiT=JSampCSC=YampNEWS=NampPAGE=fulltextampD=premampAN=27166643

[12] Aideen Gibson Claire McCauley Maurice Mulvenna Assumpta Ryan Liz LairdKevin Curran Brendan Bunting Finola Ferry and Raymond Bond 2016 As-sessing Usability Testing for People Living with Dementia In Proceedings of the4th Workshop on ICTs for Improving Patients Rehabilitation Research Techniques(REHAB rsquo16) ACM New York NY USA 25ndash31 httpsdoiorg10114530514883051492

[13] Samuel Holmes Anne Moorhead RR Bond Huiru Zheng Vivien Coates andMichael McTear 2018 A New Automated Chatbot for Weight Loss Mainte-nance In Proceedings of the 32nd International BCS Human Computer InteractionConference (HCI 2018) httpsdoiorg1014236ewicHCI2018103

[14] A Baki Kocaballi L Laranjo and E Coiera 2018 Measuring User Experience inConversational Interfaces A Comparison of Six Questionnaires In Proceedings ofthe 32nd International BCS Human Computer Interaction Conference (HCI 2018)

[15] Bettina Laugwitz Theo Held and Martin Schrepp 2008 Construction andEvaluation of a User Experience Questionnaire In HCI and Usability for Educationand Work 4th Symposium of the Workgroup Human-Computer Interaction andUsability Engineering of the Austrian Computer Society USAB Vol 5298 63ndash76httpsdoiorg101007978-3-540-89350-9_6

[16] J MartAtildeŋn C MuAtildeśoz-Romero and N AtildeĄbalos 2017 chatbottest - Improveyour chatbotrsquos design

[17] Robert J Moore Raphael Arar Guang-Jie Ren and Margaret H Szymanski 2017Conversational UX Design In Proceedings of the 2017 CHI Conference ExtendedAbstracts on Human Factors in Computing Systems (CHI EA rsquo17) ACM New YorkNY USA 492ndash497 httpsdoiorg10114530270633027077

[18] J Nielsen 1994 10 Heuristics for User Interface design[19] Jakob Nielsen and Thomas K Landauer 1993 A Mathematical Model of the

Finding of Usability Problems In Proceedings of the INTERCHI rsquo93 Conference onHuman Factors in Computing Systems (INTERCHI rsquo93) IOS Press AmsterdamThe Netherlands The Netherlands 206ndash213 httpdlacmorgcitationcfmid=164632164904

[20] Wharton University of Pennsylvania 2016 The Rise of the Chatbots Is It Timeto Embrace Them

[21] Ofcom 2017 The UK Communications Market Telecoms amp Networks TechnicalReport Ofcom

[22] Kelly Pedotto Vivey Chen and JP McElyea 2017 The 2017 US Mobile AppReport

[23] Fred F Reichheld 2003 The One Number You Need to Grow[24] Jeff Sauro 2012 Predicting Net Promoter scores from System Usability Scale

Scores[25] Jeff Sauro 2018 5 ways to interpret a SUS score[26] Martin Schrepp 2019 User Experience Questionnaire Data Analysis Tool[27] B Shneiderman C Plaisant M Cohen S Jacobs and N Elmqvist 2016 Designing

the User Interface Strategies for Effective Human-Computer Interaction (sixthedition ed) Pearson

[28] T Tullis and B Albert 2018 Measuring the User Experience SUS CalculationTool

[29] UX247 2018 Chatbots Usability Testing amp Chatbots

  • Abstract
  • 1 Introduction
    • 11 Chatbot UX Design
    • 12 Testing Usability of Chatbots
    • 13 Chatbots in Healthcare
    • 14 The WeightMentor Chatbot
      • 2 Research Questions
      • 3 Methods
        • 31 Research Design
        • 32 Chatbot Development
        • 33 Test Protocol
        • 34 Usability Metrics
        • 35 Questionnaire Analysis
        • 36 Data Analysis amp Presentation Tools
          • 4 Results
            • 41 WeightMentor Chatbot Usability
            • 42 Usability Questionnaire Correlation
            • 43 Chatbot Usability Issues
            • 44 Chatbot Task Completion
              • 5 Discussion
              • 6 Conclusion
              • Acknowledgments
              • References
Page 5: Usability testing of a healthcare chatbot: Can we use ... · bers for identifying usability issues may not apply to chatbots. The study design was a usability study. WeightMentor

Chatbots Can conventional methods be used to test usability ECCE rsquo19 September 10ndash13 2019 Belfast UK

36 Data Analysis amp Presentation ToolsR Studio and Microsoft Excel were used for data analysis Box plotswere produced using R Studio and barline plots were producedusing excel Statistical tests such as correlation tests and t-testswere conducted in R Studio Correlation analysis was conductedusing Pearson correlation coefficient Mean task completion timeswere compared for significance using paired Wilcoxon analysisThe p-value threshold for statistical significance was 5 (or 005)

4 RESULTS41 WeightMentor Chatbot Usability411 System Usability Scale A total of 30 participants (healthyadults) were recruited andwho evaluated the usability of theWeight-Mentor chatbot A boxplot ofWeightMentor SUS scores is shownin Figure 2 The meanWeightMentor SUS score was 8483 plusmn 1203and the median was 8625 The highest score was 1000 and thelowest was 5750 The meanWeightMentor score places the chatbotin the 96th-100th percentile range equivalent to Grade A+ BestImaginable Acceptable and NPS Class Promoter on the variousscales discussed in methods Hence the chatbot has a high degreeof usability according to traditional usability SUS scores HoweverSUS distributions for benchmarking have not included usabilityscores from testing chatbots

Figure 2 WeightMentor SUS scores (Benchmark of 680 ismarked)

412 User Experience Questionnaire The chatbot scored highly inall UEQ scales Scores werewell above benchmark and are presentedgraphically in Figure 3 Participant scores for each scale were allabove +08 suggesting that in general participants were satisfiedwith theWeightMentor user experience

Figure 3WeightMentor UEQ scores against benchmark

413 Chatbot UsabilityQuestionnaire A box plot ofWeightMentorCUQ scores is shown in Figure 4 The mean score was 7620plusmn 1146and the median was 765 The highest score was 1000 and lowestwas 484

Figure 4WeightMentor CUQ scores

42 Usability Questionnaire CorrelationCorrelations between each usability survey score and scatter plotsare shown in Figure 5 All correlations are statistically significantsince the p-values are all below 005 (5) Multiple Regression wasused to determine if CUQ score can be determined by SUS andMean UEQ score Results are shown in Table 5

43 Chatbot Usability IssuesThirty participants identified fifty-three usability issues Usabilityissues were selected based on think-aloud data and feedback fromparticipants in the Post-Test Survey Usability issues identified perparticipant were listed in a spreadsheet in the chronological orderin which the tests were conducted All usability issues identifiedby the first participant were treated as unique and usability issuesidentified by subsequent participants were considered unique ifthey had not been identified by a previous participant All uniqueusability issues identified at each test are plotted as a line graph inFigure 6

To determine the best-case scenario participants were re-sortedin a spreadsheet in descending order based on the number of us-ability issues identified per participant Where more than one par-ticipant identified the same number of issues these were sorted inchronological order of participants Usability issues identified bythe first participant were treated as unique and subsequent issueswere counted only if they had not yet been identified by previousparticipants Identified usability issues were plotted against numberof participants as a line graph shown in Figure 6 In this scenariothe first participant identifies the most usability issues and the lastparticipant identifies the least This represents the best case (iethe least number of subjects required to identify almost all of theusability issues)

To determine the worst-case scenario participants for best casescenario were sorted in reverse order and usability issues wereidentified as for previous line graphs The graph of this scenariois also shown in Figure 6 where the first participant identified thefewest usability issues whilst the last participant identified themost usability issues

ECCE rsquo19 September 10ndash13 2019 Belfast UK Holmes Moorhead Bond Zheng Coates amp McTear

Figure 5 Scatter plots of (a) SUS and CUQ scores (b) SUS and UEQ mean scores (c) CUQ and UEQ mean scores

Table 5 Multiple regression results where SUS and UEQ are independent variables and CUQ is the dependent variable

Coefficients Estimate Std Error t-value p-value

Intercept 360128 85217 4226 0000243SUS 02578 01307 1973 0058854UEQ 103694 21264 4876 00000425Residual Standard Error 5253 on 27 degrees of freedomMultiple R-Squared 08046 Adjusted R-Squared 07901F-statistic 5558 on 2 and 27 DF p-value 00000000002681

44 Chatbot Task CompletionDuring the usability tests participantsacircĂŹ interactions with thechatbot were video recorded Task completion times were calculatedfor each individual participant along with the mean time overallA benchmark task completion time was established by recordingthe performance of the developer (SH) Two tasks (task 2 and task3) were repeated four times by participants to determine if taskcompletion times improved with each repetition Bar plots of taskcompletion times against the benchmark are shown in Figure 7The p-values for repeated tasks are shown in Table 6

5 DISCUSSIONWeightMentor mean SUS score places the chatbot at the top of thevarious scales discussed in section 331 above This suggests thatin comparison to conventional systems WeightMentor is highlyusable Similarly WeightMentor UEQ scores were favourable whencompared to UEQ benchmark However SUS and UEQ benchmarkscores are derived from tests of conventional systems such as web-sites therefore do not include any chatbot scores Thus whileWeightMentor may score highly using these metrics it is impossi-ble to determine the accuracy of this score relative to other chatbots

Chatbots Can conventional methods be used to test usability ECCE rsquo19 September 10ndash13 2019 Belfast UK

Figure 6 Usability issues identified against number of par-ticipants

Figure 7 Task completion times against benchmark (a) pertask with repetitions (b) per task overall

Correlation between the three main questionnaires was generally

Table 6 Task completion time p-values

Comparison p-value

Task 2 Attempt 1 amp Attempt 2 0002621Attempt 2 amp Attempt 3 03457Attempt 3 amp Attempt 4 08284

Task 3 Attempt 1 amp Attempt 2 006736Attempt 2 amp Attempt 3 09249Attempt 3 amp Attempt 4 01031

strong and was highest between the CUQ and UEQ Multiple Re-gression results suggest that 80 of the variance in CUQ score canbe explained by SUS and UEQ Thus 20 of the variance in CUQscore is explained by other factors ie CUQ is perhaps measuringconstructs more closely related to chatbots that is not being mea-sured by SUS and UEQ The p-value for SUS is greater than 005(5) thus it is not statistically significant in predicting CUQ scoresand the p-value for UEQ is less than 005 hence this is statisticallysignificant in predicting CUQ score This indicates that UEQ hasa greater influence on CUQ score than SUS which suggests thatUEQ and CUQ are measuring similar aspects of chatbot usabilityand UX This finding can be rationalized given that UEQ is morecomplex when compared to SUS The implication of these findingsis that SUS may be less effective for measuring chatbot usability onits own as it may not measure aspects of UX that make chatbotsparticularly usable

Analysis of identified usability issues suggested that in the best-case scenario the optimum number of users is 21 (since the functionin figure 6 plateaus thereafter) In the worst-case scenario theoptimum number of users is 30 In chronological order the optimumnumber was 26 - 29 users The mean number of users required toidentify most of the usability issues in the chatbot is 26 Nielsen ampLandauer (1993) determined that 80 of unique usability issues maybe captured by no more than 5 to 8 users however this researchconcerned conventional systems not chatbots and it may be thecase that the nature of chatbots makes it more difficult to identifyusability issues with a smaller number of participants [19]

In general task completion times did improve with each repeti-tion of a task however while the time difference was statisticallysignificant between the first attempt and the second attempt at task2 it was not statistically significant between the second and thirdand third and fourth attempts Similarly time differences were notstatistically significant between any of the attempts at task 3 whichmay be because task 3 was very similar in procedure to task 2 Thissuggests that it may be possible for users to become proficient witha new chatbot very quickly owing to their simplicity and ease ofuse

6 CONCLUSIONChatbot UX design and usability testing may require nontraditionalmethods and multiple metrics are likely to provide a more com-prehensive picture of chatbot usability The WeightMentor chatbotscored highly on both SUS and UEQ scales and correlation anal-yses suggest that correlation is stronger between CUQ and UEQthan CUQ and SUS and validation of the CUQ will increase its

ECCE rsquo19 September 10ndash13 2019 Belfast UK Holmes Moorhead Bond Zheng Coates amp McTear

effectiveness as a usability analysis tool Given that the variance ofCUQ scores are not completely explained by SUS and UEQ there isan argument that these traditional usability testing surveys do notevaluate all aspects of a chatbot interface This study also suggeststhat approximately 26 subjects are required to identify almost allthe usability issues in the chatbot which challenges previous re-search Whilst primitive this work also suggests that users becomeoptimal after just one attempt of a task when using a chatbot Thiscould be explained by the fact that chatbots can be less complex inthat they lack visual hierarchy and complexity as seen in normallygraphical user interfaces This study suggests that conventional us-ability metrics may not be best suited to assessing chatbot usabilityand that metrics will be most effective if they measure aspects ofusability that are more closely related to chatbots Chatbot usabilitytesting may also potentially require a greater number of users thansuggested by previous research in order to maximize capture of us-ability issues Finally this study suggests that users can potentiallyreach optimum proficiency with chatbots very quickly

ACKNOWLEDGMENTSThis research study is part of a PhD research project at the UlsterUniversity funded by the Department for the Economy NorthernIreland

REFERENCES[1] Aaron Bangor Philip T Kortum and James T Miller 2008 An Empirical Evalua-

tion of the System Usability Scale International Journal of Human Computer In-teraction 24 6 (0729 2008) 574ndash594 httpsdoiorg10108010447310802205776doi 10108010447310802205776

[2] Aaron Bangor Philip T Kortum and James T Miller 2009 Determining WhatIndividual SUS Scores Mean Adding an Adjective Rating Scale Journal ofUsability Studies 4 3 (2009) 114ndash123

[3] Azy Barak Britt Klein and Judith G Proudfoot 2009 Defining Internet-Supported Therapeutic Interventions Annals of Behavioral Medicine 38 1 (08012009) 4ndash17 httpsdoiorg101007s12160-009-9130-7 ID Barak2009

[4] John Brooke 1996 SUS A rsquoquick and dirtyrsquo usability scale CRC Press 189ndash194[5] Raluca Budiu 2018 The User Experience of Chatbots[6] Gillian Cameron David Cameron Gavin Megaw Raymond Bond Maurice Mul-

venna Siobhan OrsquoNeill Cherie Armour and Michael McTear 2017 Towards aChatbot for Digital Counselling In Proceedings of the 31st British Computer SocietyHuman Computer Interaction Conference (HCI rsquo17) BCS Learning amp DevelopmentLtd Swindon UK 241ndash247 httpsdoiorg1014236ewicHCI201724

[7] Gillian Cameron David Cameron Gavin Megaw RR Bond Maurice MulvennaSiobhan OrsquoNeill C Armour and Michael McTear 2018 Back to the Future Lessonsfrom Knowledge Engineering Methodologies for Chatbot Design and Developmenthttpsdoiorg1014236ewicHCI2018153

[8] Gillian Cameron David Cameron Gavin Megaw RR Bond Maurice MulvennaSiobhan OrsquoNeill C Armour andMichael McTear 2018 Best practices for designingchatbots in mental healthcare acircĂŞ A case study on iHelpr httpsdoiorg1014236ewicHCI2018129

[9] E L Donaldson S Fallows and M Morris 2014 A text message based weightmanagement intervention for overweight adults Journal of Human Nutrition ampDietetics 27 Suppl 2 (Apr 2014) 90ndash97 httpovidspovidcomathensovidwebcgiT=JSampCSC=YampNEWS=NampPAGE=fulltextampD=medlampAN=23738786

[10] AH Fadil and S Gabrielli 2017 Addressing Challenges in Promoting HealthyLifestyles The AI-Chatbot Approach

[11] Brianna S Fjeldsoe Ana D Goode Philayrath Phongsavan Adrian BaumanGenevieve Maher Elisabeth Winkler and Elizabeth G Eakin 2016 Evaluat-ing the Maintenance of Lifestyle Changes in a Randomized Controlled Trialof the rsquoGet Healthy Stay Healthyrsquo Program JMIR MHealth and UHealth 4 2(2016) e42 httpovidspovidcomathensovidwebcgiT=JSampCSC=YampNEWS=NampPAGE=fulltextampD=premampAN=27166643

[12] Aideen Gibson Claire McCauley Maurice Mulvenna Assumpta Ryan Liz LairdKevin Curran Brendan Bunting Finola Ferry and Raymond Bond 2016 As-sessing Usability Testing for People Living with Dementia In Proceedings of the4th Workshop on ICTs for Improving Patients Rehabilitation Research Techniques(REHAB rsquo16) ACM New York NY USA 25ndash31 httpsdoiorg10114530514883051492

[13] Samuel Holmes Anne Moorhead RR Bond Huiru Zheng Vivien Coates andMichael McTear 2018 A New Automated Chatbot for Weight Loss Mainte-nance In Proceedings of the 32nd International BCS Human Computer InteractionConference (HCI 2018) httpsdoiorg1014236ewicHCI2018103

[14] A Baki Kocaballi L Laranjo and E Coiera 2018 Measuring User Experience inConversational Interfaces A Comparison of Six Questionnaires In Proceedings ofthe 32nd International BCS Human Computer Interaction Conference (HCI 2018)

[15] Bettina Laugwitz Theo Held and Martin Schrepp 2008 Construction andEvaluation of a User Experience Questionnaire In HCI and Usability for Educationand Work 4th Symposium of the Workgroup Human-Computer Interaction andUsability Engineering of the Austrian Computer Society USAB Vol 5298 63ndash76httpsdoiorg101007978-3-540-89350-9_6

[16] J MartAtildeŋn C MuAtildeśoz-Romero and N AtildeĄbalos 2017 chatbottest - Improveyour chatbotrsquos design

[17] Robert J Moore Raphael Arar Guang-Jie Ren and Margaret H Szymanski 2017Conversational UX Design In Proceedings of the 2017 CHI Conference ExtendedAbstracts on Human Factors in Computing Systems (CHI EA rsquo17) ACM New YorkNY USA 492ndash497 httpsdoiorg10114530270633027077

[18] J Nielsen 1994 10 Heuristics for User Interface design[19] Jakob Nielsen and Thomas K Landauer 1993 A Mathematical Model of the

Finding of Usability Problems In Proceedings of the INTERCHI rsquo93 Conference onHuman Factors in Computing Systems (INTERCHI rsquo93) IOS Press AmsterdamThe Netherlands The Netherlands 206ndash213 httpdlacmorgcitationcfmid=164632164904

[20] Wharton University of Pennsylvania 2016 The Rise of the Chatbots Is It Timeto Embrace Them

[21] Ofcom 2017 The UK Communications Market Telecoms amp Networks TechnicalReport Ofcom

[22] Kelly Pedotto Vivey Chen and JP McElyea 2017 The 2017 US Mobile AppReport

[23] Fred F Reichheld 2003 The One Number You Need to Grow[24] Jeff Sauro 2012 Predicting Net Promoter scores from System Usability Scale

Scores[25] Jeff Sauro 2018 5 ways to interpret a SUS score[26] Martin Schrepp 2019 User Experience Questionnaire Data Analysis Tool[27] B Shneiderman C Plaisant M Cohen S Jacobs and N Elmqvist 2016 Designing

the User Interface Strategies for Effective Human-Computer Interaction (sixthedition ed) Pearson

[28] T Tullis and B Albert 2018 Measuring the User Experience SUS CalculationTool

[29] UX247 2018 Chatbots Usability Testing amp Chatbots

  • Abstract
  • 1 Introduction
    • 11 Chatbot UX Design
    • 12 Testing Usability of Chatbots
    • 13 Chatbots in Healthcare
    • 14 The WeightMentor Chatbot
      • 2 Research Questions
      • 3 Methods
        • 31 Research Design
        • 32 Chatbot Development
        • 33 Test Protocol
        • 34 Usability Metrics
        • 35 Questionnaire Analysis
        • 36 Data Analysis amp Presentation Tools
          • 4 Results
            • 41 WeightMentor Chatbot Usability
            • 42 Usability Questionnaire Correlation
            • 43 Chatbot Usability Issues
            • 44 Chatbot Task Completion
              • 5 Discussion
              • 6 Conclusion
              • Acknowledgments
              • References
Page 6: Usability testing of a healthcare chatbot: Can we use ... · bers for identifying usability issues may not apply to chatbots. The study design was a usability study. WeightMentor

ECCE rsquo19 September 10ndash13 2019 Belfast UK Holmes Moorhead Bond Zheng Coates amp McTear

Figure 5 Scatter plots of (a) SUS and CUQ scores (b) SUS and UEQ mean scores (c) CUQ and UEQ mean scores

Table 5 Multiple regression results where SUS and UEQ are independent variables and CUQ is the dependent variable

Coefficients Estimate Std Error t-value p-value

Intercept 360128 85217 4226 0000243SUS 02578 01307 1973 0058854UEQ 103694 21264 4876 00000425Residual Standard Error 5253 on 27 degrees of freedomMultiple R-Squared 08046 Adjusted R-Squared 07901F-statistic 5558 on 2 and 27 DF p-value 00000000002681

44 Chatbot Task CompletionDuring the usability tests participantsacircĂŹ interactions with thechatbot were video recorded Task completion times were calculatedfor each individual participant along with the mean time overallA benchmark task completion time was established by recordingthe performance of the developer (SH) Two tasks (task 2 and task3) were repeated four times by participants to determine if taskcompletion times improved with each repetition Bar plots of taskcompletion times against the benchmark are shown in Figure 7The p-values for repeated tasks are shown in Table 6

5 DISCUSSIONWeightMentor mean SUS score places the chatbot at the top of thevarious scales discussed in section 331 above This suggests thatin comparison to conventional systems WeightMentor is highlyusable Similarly WeightMentor UEQ scores were favourable whencompared to UEQ benchmark However SUS and UEQ benchmarkscores are derived from tests of conventional systems such as web-sites therefore do not include any chatbot scores Thus whileWeightMentor may score highly using these metrics it is impossi-ble to determine the accuracy of this score relative to other chatbots

Chatbots Can conventional methods be used to test usability ECCE rsquo19 September 10ndash13 2019 Belfast UK

Figure 6 Usability issues identified against number of par-ticipants

Figure 7 Task completion times against benchmark (a) pertask with repetitions (b) per task overall

Correlation between the three main questionnaires was generally

Table 6 Task completion time p-values

Comparison p-value

Task 2 Attempt 1 amp Attempt 2 0002621Attempt 2 amp Attempt 3 03457Attempt 3 amp Attempt 4 08284

Task 3 Attempt 1 amp Attempt 2 006736Attempt 2 amp Attempt 3 09249Attempt 3 amp Attempt 4 01031

strong and was highest between the CUQ and UEQ Multiple Re-gression results suggest that 80 of the variance in CUQ score canbe explained by SUS and UEQ Thus 20 of the variance in CUQscore is explained by other factors ie CUQ is perhaps measuringconstructs more closely related to chatbots that is not being mea-sured by SUS and UEQ The p-value for SUS is greater than 005(5) thus it is not statistically significant in predicting CUQ scoresand the p-value for UEQ is less than 005 hence this is statisticallysignificant in predicting CUQ score This indicates that UEQ hasa greater influence on CUQ score than SUS which suggests thatUEQ and CUQ are measuring similar aspects of chatbot usabilityand UX This finding can be rationalized given that UEQ is morecomplex when compared to SUS The implication of these findingsis that SUS may be less effective for measuring chatbot usability onits own as it may not measure aspects of UX that make chatbotsparticularly usable

Analysis of identified usability issues suggested that in the best-case scenario the optimum number of users is 21 (since the functionin figure 6 plateaus thereafter) In the worst-case scenario theoptimum number of users is 30 In chronological order the optimumnumber was 26 - 29 users The mean number of users required toidentify most of the usability issues in the chatbot is 26 Nielsen ampLandauer (1993) determined that 80 of unique usability issues maybe captured by no more than 5 to 8 users however this researchconcerned conventional systems not chatbots and it may be thecase that the nature of chatbots makes it more difficult to identifyusability issues with a smaller number of participants [19]

In general task completion times did improve with each repeti-tion of a task however while the time difference was statisticallysignificant between the first attempt and the second attempt at task2 it was not statistically significant between the second and thirdand third and fourth attempts Similarly time differences were notstatistically significant between any of the attempts at task 3 whichmay be because task 3 was very similar in procedure to task 2 Thissuggests that it may be possible for users to become proficient witha new chatbot very quickly owing to their simplicity and ease ofuse

6 CONCLUSIONChatbot UX design and usability testing may require nontraditionalmethods and multiple metrics are likely to provide a more com-prehensive picture of chatbot usability The WeightMentor chatbotscored highly on both SUS and UEQ scales and correlation anal-yses suggest that correlation is stronger between CUQ and UEQthan CUQ and SUS and validation of the CUQ will increase its

ECCE rsquo19 September 10ndash13 2019 Belfast UK Holmes Moorhead Bond Zheng Coates amp McTear

effectiveness as a usability analysis tool Given that the variance ofCUQ scores are not completely explained by SUS and UEQ there isan argument that these traditional usability testing surveys do notevaluate all aspects of a chatbot interface This study also suggeststhat approximately 26 subjects are required to identify almost allthe usability issues in the chatbot which challenges previous re-search Whilst primitive this work also suggests that users becomeoptimal after just one attempt of a task when using a chatbot Thiscould be explained by the fact that chatbots can be less complex inthat they lack visual hierarchy and complexity as seen in normallygraphical user interfaces This study suggests that conventional us-ability metrics may not be best suited to assessing chatbot usabilityand that metrics will be most effective if they measure aspects ofusability that are more closely related to chatbots Chatbot usabilitytesting may also potentially require a greater number of users thansuggested by previous research in order to maximize capture of us-ability issues Finally this study suggests that users can potentiallyreach optimum proficiency with chatbots very quickly

ACKNOWLEDGMENTSThis research study is part of a PhD research project at the UlsterUniversity funded by the Department for the Economy NorthernIreland

REFERENCES[1] Aaron Bangor Philip T Kortum and James T Miller 2008 An Empirical Evalua-

tion of the System Usability Scale International Journal of Human Computer In-teraction 24 6 (0729 2008) 574ndash594 httpsdoiorg10108010447310802205776doi 10108010447310802205776

[2] Aaron Bangor Philip T Kortum and James T Miller 2009 Determining WhatIndividual SUS Scores Mean Adding an Adjective Rating Scale Journal ofUsability Studies 4 3 (2009) 114ndash123

[3] Azy Barak Britt Klein and Judith G Proudfoot 2009 Defining Internet-Supported Therapeutic Interventions Annals of Behavioral Medicine 38 1 (08012009) 4ndash17 httpsdoiorg101007s12160-009-9130-7 ID Barak2009

[4] John Brooke 1996 SUS A rsquoquick and dirtyrsquo usability scale CRC Press 189ndash194[5] Raluca Budiu 2018 The User Experience of Chatbots[6] Gillian Cameron David Cameron Gavin Megaw Raymond Bond Maurice Mul-

venna Siobhan OrsquoNeill Cherie Armour and Michael McTear 2017 Towards aChatbot for Digital Counselling In Proceedings of the 31st British Computer SocietyHuman Computer Interaction Conference (HCI rsquo17) BCS Learning amp DevelopmentLtd Swindon UK 241ndash247 httpsdoiorg1014236ewicHCI201724

[7] Gillian Cameron David Cameron Gavin Megaw RR Bond Maurice MulvennaSiobhan OrsquoNeill C Armour and Michael McTear 2018 Back to the Future Lessonsfrom Knowledge Engineering Methodologies for Chatbot Design and Developmenthttpsdoiorg1014236ewicHCI2018153

[8] Gillian Cameron David Cameron Gavin Megaw RR Bond Maurice MulvennaSiobhan OrsquoNeill C Armour andMichael McTear 2018 Best practices for designingchatbots in mental healthcare acircĂŞ A case study on iHelpr httpsdoiorg1014236ewicHCI2018129

[9] E L Donaldson S Fallows and M Morris 2014 A text message based weightmanagement intervention for overweight adults Journal of Human Nutrition ampDietetics 27 Suppl 2 (Apr 2014) 90ndash97 httpovidspovidcomathensovidwebcgiT=JSampCSC=YampNEWS=NampPAGE=fulltextampD=medlampAN=23738786

[10] AH Fadil and S Gabrielli 2017 Addressing Challenges in Promoting HealthyLifestyles The AI-Chatbot Approach

[11] Brianna S Fjeldsoe Ana D Goode Philayrath Phongsavan Adrian BaumanGenevieve Maher Elisabeth Winkler and Elizabeth G Eakin 2016 Evaluat-ing the Maintenance of Lifestyle Changes in a Randomized Controlled Trialof the rsquoGet Healthy Stay Healthyrsquo Program JMIR MHealth and UHealth 4 2(2016) e42 httpovidspovidcomathensovidwebcgiT=JSampCSC=YampNEWS=NampPAGE=fulltextampD=premampAN=27166643

[12] Aideen Gibson Claire McCauley Maurice Mulvenna Assumpta Ryan Liz LairdKevin Curran Brendan Bunting Finola Ferry and Raymond Bond 2016 As-sessing Usability Testing for People Living with Dementia In Proceedings of the4th Workshop on ICTs for Improving Patients Rehabilitation Research Techniques(REHAB rsquo16) ACM New York NY USA 25ndash31 httpsdoiorg10114530514883051492

[13] Samuel Holmes Anne Moorhead RR Bond Huiru Zheng Vivien Coates andMichael McTear 2018 A New Automated Chatbot for Weight Loss Mainte-nance In Proceedings of the 32nd International BCS Human Computer InteractionConference (HCI 2018) httpsdoiorg1014236ewicHCI2018103

[14] A Baki Kocaballi L Laranjo and E Coiera 2018 Measuring User Experience inConversational Interfaces A Comparison of Six Questionnaires In Proceedings ofthe 32nd International BCS Human Computer Interaction Conference (HCI 2018)

[15] Bettina Laugwitz Theo Held and Martin Schrepp 2008 Construction andEvaluation of a User Experience Questionnaire In HCI and Usability for Educationand Work 4th Symposium of the Workgroup Human-Computer Interaction andUsability Engineering of the Austrian Computer Society USAB Vol 5298 63ndash76httpsdoiorg101007978-3-540-89350-9_6

[16] J MartAtildeŋn C MuAtildeśoz-Romero and N AtildeĄbalos 2017 chatbottest - Improveyour chatbotrsquos design

[17] Robert J Moore Raphael Arar Guang-Jie Ren and Margaret H Szymanski 2017Conversational UX Design In Proceedings of the 2017 CHI Conference ExtendedAbstracts on Human Factors in Computing Systems (CHI EA rsquo17) ACM New YorkNY USA 492ndash497 httpsdoiorg10114530270633027077

[18] J Nielsen 1994 10 Heuristics for User Interface design[19] Jakob Nielsen and Thomas K Landauer 1993 A Mathematical Model of the

Finding of Usability Problems In Proceedings of the INTERCHI rsquo93 Conference onHuman Factors in Computing Systems (INTERCHI rsquo93) IOS Press AmsterdamThe Netherlands The Netherlands 206ndash213 httpdlacmorgcitationcfmid=164632164904

[20] Wharton University of Pennsylvania 2016 The Rise of the Chatbots Is It Timeto Embrace Them

[21] Ofcom 2017 The UK Communications Market Telecoms amp Networks TechnicalReport Ofcom

[22] Kelly Pedotto Vivey Chen and JP McElyea 2017 The 2017 US Mobile AppReport

[23] Fred F Reichheld 2003 The One Number You Need to Grow[24] Jeff Sauro 2012 Predicting Net Promoter scores from System Usability Scale

Scores[25] Jeff Sauro 2018 5 ways to interpret a SUS score[26] Martin Schrepp 2019 User Experience Questionnaire Data Analysis Tool[27] B Shneiderman C Plaisant M Cohen S Jacobs and N Elmqvist 2016 Designing

the User Interface Strategies for Effective Human-Computer Interaction (sixthedition ed) Pearson

[28] T Tullis and B Albert 2018 Measuring the User Experience SUS CalculationTool

[29] UX247 2018 Chatbots Usability Testing amp Chatbots

  • Abstract
  • 1 Introduction
    • 11 Chatbot UX Design
    • 12 Testing Usability of Chatbots
    • 13 Chatbots in Healthcare
    • 14 The WeightMentor Chatbot
      • 2 Research Questions
      • 3 Methods
        • 31 Research Design
        • 32 Chatbot Development
        • 33 Test Protocol
        • 34 Usability Metrics
        • 35 Questionnaire Analysis
        • 36 Data Analysis amp Presentation Tools
          • 4 Results
            • 41 WeightMentor Chatbot Usability
            • 42 Usability Questionnaire Correlation
            • 43 Chatbot Usability Issues
            • 44 Chatbot Task Completion
              • 5 Discussion
              • 6 Conclusion
              • Acknowledgments
              • References
Page 7: Usability testing of a healthcare chatbot: Can we use ... · bers for identifying usability issues may not apply to chatbots. The study design was a usability study. WeightMentor

Chatbots Can conventional methods be used to test usability ECCE rsquo19 September 10ndash13 2019 Belfast UK

Figure 6 Usability issues identified against number of par-ticipants

Figure 7 Task completion times against benchmark (a) pertask with repetitions (b) per task overall

Correlation between the three main questionnaires was generally

Table 6 Task completion time p-values

Comparison p-value

Task 2 Attempt 1 amp Attempt 2 0002621Attempt 2 amp Attempt 3 03457Attempt 3 amp Attempt 4 08284

Task 3 Attempt 1 amp Attempt 2 006736Attempt 2 amp Attempt 3 09249Attempt 3 amp Attempt 4 01031

strong and was highest between the CUQ and UEQ Multiple Re-gression results suggest that 80 of the variance in CUQ score canbe explained by SUS and UEQ Thus 20 of the variance in CUQscore is explained by other factors ie CUQ is perhaps measuringconstructs more closely related to chatbots that is not being mea-sured by SUS and UEQ The p-value for SUS is greater than 005(5) thus it is not statistically significant in predicting CUQ scoresand the p-value for UEQ is less than 005 hence this is statisticallysignificant in predicting CUQ score This indicates that UEQ hasa greater influence on CUQ score than SUS which suggests thatUEQ and CUQ are measuring similar aspects of chatbot usabilityand UX This finding can be rationalized given that UEQ is morecomplex when compared to SUS The implication of these findingsis that SUS may be less effective for measuring chatbot usability onits own as it may not measure aspects of UX that make chatbotsparticularly usable

Analysis of identified usability issues suggested that in the best-case scenario the optimum number of users is 21 (since the functionin figure 6 plateaus thereafter) In the worst-case scenario theoptimum number of users is 30 In chronological order the optimumnumber was 26 - 29 users The mean number of users required toidentify most of the usability issues in the chatbot is 26 Nielsen ampLandauer (1993) determined that 80 of unique usability issues maybe captured by no more than 5 to 8 users however this researchconcerned conventional systems not chatbots and it may be thecase that the nature of chatbots makes it more difficult to identifyusability issues with a smaller number of participants [19]

In general task completion times did improve with each repeti-tion of a task however while the time difference was statisticallysignificant between the first attempt and the second attempt at task2 it was not statistically significant between the second and thirdand third and fourth attempts Similarly time differences were notstatistically significant between any of the attempts at task 3 whichmay be because task 3 was very similar in procedure to task 2 Thissuggests that it may be possible for users to become proficient witha new chatbot very quickly owing to their simplicity and ease ofuse

6 CONCLUSIONChatbot UX design and usability testing may require nontraditionalmethods and multiple metrics are likely to provide a more com-prehensive picture of chatbot usability The WeightMentor chatbotscored highly on both SUS and UEQ scales and correlation anal-yses suggest that correlation is stronger between CUQ and UEQthan CUQ and SUS and validation of the CUQ will increase its

ECCE rsquo19 September 10ndash13 2019 Belfast UK Holmes Moorhead Bond Zheng Coates amp McTear

effectiveness as a usability analysis tool Given that the variance ofCUQ scores are not completely explained by SUS and UEQ there isan argument that these traditional usability testing surveys do notevaluate all aspects of a chatbot interface This study also suggeststhat approximately 26 subjects are required to identify almost allthe usability issues in the chatbot which challenges previous re-search Whilst primitive this work also suggests that users becomeoptimal after just one attempt of a task when using a chatbot Thiscould be explained by the fact that chatbots can be less complex inthat they lack visual hierarchy and complexity as seen in normallygraphical user interfaces This study suggests that conventional us-ability metrics may not be best suited to assessing chatbot usabilityand that metrics will be most effective if they measure aspects ofusability that are more closely related to chatbots Chatbot usabilitytesting may also potentially require a greater number of users thansuggested by previous research in order to maximize capture of us-ability issues Finally this study suggests that users can potentiallyreach optimum proficiency with chatbots very quickly

ACKNOWLEDGMENTSThis research study is part of a PhD research project at the UlsterUniversity funded by the Department for the Economy NorthernIreland

REFERENCES[1] Aaron Bangor Philip T Kortum and James T Miller 2008 An Empirical Evalua-

tion of the System Usability Scale International Journal of Human Computer In-teraction 24 6 (0729 2008) 574ndash594 httpsdoiorg10108010447310802205776doi 10108010447310802205776

[2] Aaron Bangor Philip T Kortum and James T Miller 2009 Determining WhatIndividual SUS Scores Mean Adding an Adjective Rating Scale Journal ofUsability Studies 4 3 (2009) 114ndash123

[3] Azy Barak Britt Klein and Judith G Proudfoot 2009 Defining Internet-Supported Therapeutic Interventions Annals of Behavioral Medicine 38 1 (08012009) 4ndash17 httpsdoiorg101007s12160-009-9130-7 ID Barak2009

[4] John Brooke 1996 SUS A rsquoquick and dirtyrsquo usability scale CRC Press 189ndash194[5] Raluca Budiu 2018 The User Experience of Chatbots[6] Gillian Cameron David Cameron Gavin Megaw Raymond Bond Maurice Mul-

venna Siobhan OrsquoNeill Cherie Armour and Michael McTear 2017 Towards aChatbot for Digital Counselling In Proceedings of the 31st British Computer SocietyHuman Computer Interaction Conference (HCI rsquo17) BCS Learning amp DevelopmentLtd Swindon UK 241ndash247 httpsdoiorg1014236ewicHCI201724

[7] Gillian Cameron David Cameron Gavin Megaw RR Bond Maurice MulvennaSiobhan OrsquoNeill C Armour and Michael McTear 2018 Back to the Future Lessonsfrom Knowledge Engineering Methodologies for Chatbot Design and Developmenthttpsdoiorg1014236ewicHCI2018153

[8] Gillian Cameron David Cameron Gavin Megaw RR Bond Maurice MulvennaSiobhan OrsquoNeill C Armour andMichael McTear 2018 Best practices for designingchatbots in mental healthcare acircĂŞ A case study on iHelpr httpsdoiorg1014236ewicHCI2018129

[9] E L Donaldson S Fallows and M Morris 2014 A text message based weightmanagement intervention for overweight adults Journal of Human Nutrition ampDietetics 27 Suppl 2 (Apr 2014) 90ndash97 httpovidspovidcomathensovidwebcgiT=JSampCSC=YampNEWS=NampPAGE=fulltextampD=medlampAN=23738786

[10] AH Fadil and S Gabrielli 2017 Addressing Challenges in Promoting HealthyLifestyles The AI-Chatbot Approach

[11] Brianna S Fjeldsoe Ana D Goode Philayrath Phongsavan Adrian BaumanGenevieve Maher Elisabeth Winkler and Elizabeth G Eakin 2016 Evaluat-ing the Maintenance of Lifestyle Changes in a Randomized Controlled Trialof the rsquoGet Healthy Stay Healthyrsquo Program JMIR MHealth and UHealth 4 2(2016) e42 httpovidspovidcomathensovidwebcgiT=JSampCSC=YampNEWS=NampPAGE=fulltextampD=premampAN=27166643

[12] Aideen Gibson Claire McCauley Maurice Mulvenna Assumpta Ryan Liz LairdKevin Curran Brendan Bunting Finola Ferry and Raymond Bond 2016 As-sessing Usability Testing for People Living with Dementia In Proceedings of the4th Workshop on ICTs for Improving Patients Rehabilitation Research Techniques(REHAB rsquo16) ACM New York NY USA 25ndash31 httpsdoiorg10114530514883051492

[13] Samuel Holmes Anne Moorhead RR Bond Huiru Zheng Vivien Coates andMichael McTear 2018 A New Automated Chatbot for Weight Loss Mainte-nance In Proceedings of the 32nd International BCS Human Computer InteractionConference (HCI 2018) httpsdoiorg1014236ewicHCI2018103

[14] A Baki Kocaballi L Laranjo and E Coiera 2018 Measuring User Experience inConversational Interfaces A Comparison of Six Questionnaires In Proceedings ofthe 32nd International BCS Human Computer Interaction Conference (HCI 2018)

[15] Bettina Laugwitz Theo Held and Martin Schrepp 2008 Construction andEvaluation of a User Experience Questionnaire In HCI and Usability for Educationand Work 4th Symposium of the Workgroup Human-Computer Interaction andUsability Engineering of the Austrian Computer Society USAB Vol 5298 63ndash76httpsdoiorg101007978-3-540-89350-9_6

[16] J MartAtildeŋn C MuAtildeśoz-Romero and N AtildeĄbalos 2017 chatbottest - Improveyour chatbotrsquos design

[17] Robert J Moore Raphael Arar Guang-Jie Ren and Margaret H Szymanski 2017Conversational UX Design In Proceedings of the 2017 CHI Conference ExtendedAbstracts on Human Factors in Computing Systems (CHI EA rsquo17) ACM New YorkNY USA 492ndash497 httpsdoiorg10114530270633027077

[18] J Nielsen 1994 10 Heuristics for User Interface design[19] Jakob Nielsen and Thomas K Landauer 1993 A Mathematical Model of the

Finding of Usability Problems In Proceedings of the INTERCHI rsquo93 Conference onHuman Factors in Computing Systems (INTERCHI rsquo93) IOS Press AmsterdamThe Netherlands The Netherlands 206ndash213 httpdlacmorgcitationcfmid=164632164904

[20] Wharton University of Pennsylvania 2016 The Rise of the Chatbots Is It Timeto Embrace Them

[21] Ofcom 2017 The UK Communications Market Telecoms amp Networks TechnicalReport Ofcom

[22] Kelly Pedotto Vivey Chen and JP McElyea 2017 The 2017 US Mobile AppReport

[23] Fred F Reichheld 2003 The One Number You Need to Grow[24] Jeff Sauro 2012 Predicting Net Promoter scores from System Usability Scale

Scores[25] Jeff Sauro 2018 5 ways to interpret a SUS score[26] Martin Schrepp 2019 User Experience Questionnaire Data Analysis Tool[27] B Shneiderman C Plaisant M Cohen S Jacobs and N Elmqvist 2016 Designing

the User Interface Strategies for Effective Human-Computer Interaction (sixthedition ed) Pearson

[28] T Tullis and B Albert 2018 Measuring the User Experience SUS CalculationTool

[29] UX247 2018 Chatbots Usability Testing amp Chatbots

  • Abstract
  • 1 Introduction
    • 11 Chatbot UX Design
    • 12 Testing Usability of Chatbots
    • 13 Chatbots in Healthcare
    • 14 The WeightMentor Chatbot
      • 2 Research Questions
      • 3 Methods
        • 31 Research Design
        • 32 Chatbot Development
        • 33 Test Protocol
        • 34 Usability Metrics
        • 35 Questionnaire Analysis
        • 36 Data Analysis amp Presentation Tools
          • 4 Results
            • 41 WeightMentor Chatbot Usability
            • 42 Usability Questionnaire Correlation
            • 43 Chatbot Usability Issues
            • 44 Chatbot Task Completion
              • 5 Discussion
              • 6 Conclusion
              • Acknowledgments
              • References
Page 8: Usability testing of a healthcare chatbot: Can we use ... · bers for identifying usability issues may not apply to chatbots. The study design was a usability study. WeightMentor

ECCE rsquo19 September 10ndash13 2019 Belfast UK Holmes Moorhead Bond Zheng Coates amp McTear

effectiveness as a usability analysis tool Given that the variance ofCUQ scores are not completely explained by SUS and UEQ there isan argument that these traditional usability testing surveys do notevaluate all aspects of a chatbot interface This study also suggeststhat approximately 26 subjects are required to identify almost allthe usability issues in the chatbot which challenges previous re-search Whilst primitive this work also suggests that users becomeoptimal after just one attempt of a task when using a chatbot Thiscould be explained by the fact that chatbots can be less complex inthat they lack visual hierarchy and complexity as seen in normallygraphical user interfaces This study suggests that conventional us-ability metrics may not be best suited to assessing chatbot usabilityand that metrics will be most effective if they measure aspects ofusability that are more closely related to chatbots Chatbot usabilitytesting may also potentially require a greater number of users thansuggested by previous research in order to maximize capture of us-ability issues Finally this study suggests that users can potentiallyreach optimum proficiency with chatbots very quickly

ACKNOWLEDGMENTSThis research study is part of a PhD research project at the UlsterUniversity funded by the Department for the Economy NorthernIreland

REFERENCES[1] Aaron Bangor Philip T Kortum and James T Miller 2008 An Empirical Evalua-

tion of the System Usability Scale International Journal of Human Computer In-teraction 24 6 (0729 2008) 574ndash594 httpsdoiorg10108010447310802205776doi 10108010447310802205776

[2] Aaron Bangor Philip T Kortum and James T Miller 2009 Determining WhatIndividual SUS Scores Mean Adding an Adjective Rating Scale Journal ofUsability Studies 4 3 (2009) 114ndash123

[3] Azy Barak Britt Klein and Judith G Proudfoot 2009 Defining Internet-Supported Therapeutic Interventions Annals of Behavioral Medicine 38 1 (08012009) 4ndash17 httpsdoiorg101007s12160-009-9130-7 ID Barak2009

[4] John Brooke 1996 SUS A rsquoquick and dirtyrsquo usability scale CRC Press 189ndash194[5] Raluca Budiu 2018 The User Experience of Chatbots[6] Gillian Cameron David Cameron Gavin Megaw Raymond Bond Maurice Mul-

venna Siobhan OrsquoNeill Cherie Armour and Michael McTear 2017 Towards aChatbot for Digital Counselling In Proceedings of the 31st British Computer SocietyHuman Computer Interaction Conference (HCI rsquo17) BCS Learning amp DevelopmentLtd Swindon UK 241ndash247 httpsdoiorg1014236ewicHCI201724

[7] Gillian Cameron David Cameron Gavin Megaw RR Bond Maurice MulvennaSiobhan OrsquoNeill C Armour and Michael McTear 2018 Back to the Future Lessonsfrom Knowledge Engineering Methodologies for Chatbot Design and Developmenthttpsdoiorg1014236ewicHCI2018153

[8] Gillian Cameron David Cameron Gavin Megaw RR Bond Maurice MulvennaSiobhan OrsquoNeill C Armour andMichael McTear 2018 Best practices for designingchatbots in mental healthcare acircĂŞ A case study on iHelpr httpsdoiorg1014236ewicHCI2018129

[9] E L Donaldson S Fallows and M Morris 2014 A text message based weightmanagement intervention for overweight adults Journal of Human Nutrition ampDietetics 27 Suppl 2 (Apr 2014) 90ndash97 httpovidspovidcomathensovidwebcgiT=JSampCSC=YampNEWS=NampPAGE=fulltextampD=medlampAN=23738786

[10] AH Fadil and S Gabrielli 2017 Addressing Challenges in Promoting HealthyLifestyles The AI-Chatbot Approach

[11] Brianna S Fjeldsoe Ana D Goode Philayrath Phongsavan Adrian BaumanGenevieve Maher Elisabeth Winkler and Elizabeth G Eakin 2016 Evaluat-ing the Maintenance of Lifestyle Changes in a Randomized Controlled Trialof the rsquoGet Healthy Stay Healthyrsquo Program JMIR MHealth and UHealth 4 2(2016) e42 httpovidspovidcomathensovidwebcgiT=JSampCSC=YampNEWS=NampPAGE=fulltextampD=premampAN=27166643

[12] Aideen Gibson Claire McCauley Maurice Mulvenna Assumpta Ryan Liz LairdKevin Curran Brendan Bunting Finola Ferry and Raymond Bond 2016 As-sessing Usability Testing for People Living with Dementia In Proceedings of the4th Workshop on ICTs for Improving Patients Rehabilitation Research Techniques(REHAB rsquo16) ACM New York NY USA 25ndash31 httpsdoiorg10114530514883051492

[13] Samuel Holmes Anne Moorhead RR Bond Huiru Zheng Vivien Coates andMichael McTear 2018 A New Automated Chatbot for Weight Loss Mainte-nance In Proceedings of the 32nd International BCS Human Computer InteractionConference (HCI 2018) httpsdoiorg1014236ewicHCI2018103

[14] A Baki Kocaballi L Laranjo and E Coiera 2018 Measuring User Experience inConversational Interfaces A Comparison of Six Questionnaires In Proceedings ofthe 32nd International BCS Human Computer Interaction Conference (HCI 2018)

[15] Bettina Laugwitz Theo Held and Martin Schrepp 2008 Construction andEvaluation of a User Experience Questionnaire In HCI and Usability for Educationand Work 4th Symposium of the Workgroup Human-Computer Interaction andUsability Engineering of the Austrian Computer Society USAB Vol 5298 63ndash76httpsdoiorg101007978-3-540-89350-9_6

[16] J MartAtildeŋn C MuAtildeśoz-Romero and N AtildeĄbalos 2017 chatbottest - Improveyour chatbotrsquos design

[17] Robert J Moore Raphael Arar Guang-Jie Ren and Margaret H Szymanski 2017Conversational UX Design In Proceedings of the 2017 CHI Conference ExtendedAbstracts on Human Factors in Computing Systems (CHI EA rsquo17) ACM New YorkNY USA 492ndash497 httpsdoiorg10114530270633027077

[18] J Nielsen 1994 10 Heuristics for User Interface design[19] Jakob Nielsen and Thomas K Landauer 1993 A Mathematical Model of the

Finding of Usability Problems In Proceedings of the INTERCHI rsquo93 Conference onHuman Factors in Computing Systems (INTERCHI rsquo93) IOS Press AmsterdamThe Netherlands The Netherlands 206ndash213 httpdlacmorgcitationcfmid=164632164904

[20] Wharton University of Pennsylvania 2016 The Rise of the Chatbots Is It Timeto Embrace Them

[21] Ofcom 2017 The UK Communications Market Telecoms amp Networks TechnicalReport Ofcom

[22] Kelly Pedotto Vivey Chen and JP McElyea 2017 The 2017 US Mobile AppReport

[23] Fred F Reichheld 2003 The One Number You Need to Grow[24] Jeff Sauro 2012 Predicting Net Promoter scores from System Usability Scale

Scores[25] Jeff Sauro 2018 5 ways to interpret a SUS score[26] Martin Schrepp 2019 User Experience Questionnaire Data Analysis Tool[27] B Shneiderman C Plaisant M Cohen S Jacobs and N Elmqvist 2016 Designing

the User Interface Strategies for Effective Human-Computer Interaction (sixthedition ed) Pearson

[28] T Tullis and B Albert 2018 Measuring the User Experience SUS CalculationTool

[29] UX247 2018 Chatbots Usability Testing amp Chatbots

  • Abstract
  • 1 Introduction
    • 11 Chatbot UX Design
    • 12 Testing Usability of Chatbots
    • 13 Chatbots in Healthcare
    • 14 The WeightMentor Chatbot
      • 2 Research Questions
      • 3 Methods
        • 31 Research Design
        • 32 Chatbot Development
        • 33 Test Protocol
        • 34 Usability Metrics
        • 35 Questionnaire Analysis
        • 36 Data Analysis amp Presentation Tools
          • 4 Results
            • 41 WeightMentor Chatbot Usability
            • 42 Usability Questionnaire Correlation
            • 43 Chatbot Usability Issues
            • 44 Chatbot Task Completion
              • 5 Discussion
              • 6 Conclusion
              • Acknowledgments
              • References