do you really need to test with only 5 users

DO YOU REALLY NEED TO TEST WITH ONLY 5 USERS?

PAPERS WE LOVE Seoul Chapter

2

1992

3

ROBERT A. VIRZI

PHOTO CREDIT: https://www.researchgate.net/profile/Robert_Virzi

Technical Lead at GTE Laboratories Inc (Verizon)

https://www.researchgate.net/profile/Robert_Virzi

01 Where magic number 5 came from? And why its important?

BLOGO

DESIGNING FOR JOHN D. GLOUD and

5

1986

EARLY FOCUS ON USERS & TASKS

EMPIRICAL MEASUREMENTS

ITERATIVE DESIGN

Designing for Usability

6

7

EXPENSIVE

TIME CONSUMING

Cost?

8

@seoul_victoria

x5 x3USERS”

“YOU NEED TO TEST ONLY WITH

Magic Numbers

9

Alphonse Chapanis

Beginning 1981-1994

1981

“Observing about five to six users reveals most of the problems in a usability test”

http://www.measuringu.com/blog/five-history.php

10

1982

Dr. James R. (Jim) Lewis

Beginning 1981-1994

1981

Alphonse Chapanis

“Suggested Binomial Distribution to model the sample size needed to find usability problems.”



11

1990-92

Robert Virzi

“Five users is enough to find the majority of usability problem.”

Beginning 1981-1994

Dr. James R. (Jim) Lewis

1981

Alphonse Chapanis

“Suggested Binomial Distribution to model the sample size needed to find usability problems.”


1982


12

Rebellion 2001-2006

13

Rebellion 2001-2006

14

Rebellion 2001-2006

15

Rebellion 2001-2006

16

Carl Turner, Jim Lewis and Jakob Nielsen

2006

Review the criticisms of the sample sizes formulas but show how it can and should be legitimately used.

Clarifications 2006~


17


2006

Jim Lewis

2006

Provides a detailed history of how we find sample sizes using "mostly math, not magic."


Review the criticisms of the sample sizes formulas but show how it can and should be legitimately used.


18


2006

Jim Lewis

2006

Why You Only Need To Test With Five User

2010

Jeff Sauro

Provides a detailed history of how we find sample sizes using "mostly math, not magic."


Review the criticisms of the sample sizes formulas but show how it can and should be legitimately used. MeasuringU


19

These Days

02Paper Review

@seoul_victoria

21

“Evaluating the thinking-aloud technique for use by computer scientists.” 1990

Jakob Nielsen

Initial Motivation

22

Initial Motivation

USABILITY PROBLEMS49%

3 SUBJECTS

23

1992

24

Experiment 1“ Replicating Nielsen Experiment”

25

x12SUBJECTS EXPERIMENTERS

x2

Experiment 1

26

PHOTO CREDIT: Security and Usability by Simson Garfinkel, Lorrie Faith Cranor

Experiment 1

Voice Mail System Manual

Experiment method

27

Result

USABILITY PROBLEMS

13

28

Problems Identified per Subject

USABILITY PROBLEMS

8 of 1362%

29

Problems Identified per Subject

15%USABILITY PROBLEMS

2 of 13

30

Result

31

Result

32

Subjects Uncovering Each Problem

USABILITY PROBLEM

1

10 of 1283%

33

Subjects Uncovering Each Problem

USABILITY PROBLEM

1

1 of 128%

34

Result

35

Result

36

Result

37

Result

38

Binomial Distribution Formula

1-(1-p)n

p - probability of detecting a given problemn - the sample of size

39

Result

p=0.32

40

Result

41

65% versus 49% (Nielsen)

Result

OUTPUT

Five subject are needed to find 80% of the usability problems

43

Result

44

Result

65%

45

Result

OUTPUT

Diminishing returns: later subjects are not as likely to uncover new usability problems as are earlier ones.

47

What did we learn?

USABILITY PROBLEMS

80%x5SUBJECTS

48

But what if…

USABILITY PROBLEMS

80%x5SUBJECTS

A SEVERE PROBLEM WILL BE

MISSED?

49

Experiment 2“Does the proportion of problems detected

vary as a function of problem severity?”

50

Experiment 2


x3

51

Computer-based appointment calendar

Experiment method

Experiment 2

52

Result

USABILITY PROBLEMS

40

53

Result

54

Result

OUTPUT

80% of all of the usability problems were found after five subjects were run.

56

Severity Evaluation

57

Result

58

Result

OUTPUT

The more severe a problem is, the more likely it will be uncovered within the first few subjects

60

What did we learn?

USABILITY PROBLEMS

80%x5SUBJECTS

MORE SEVERE A PROBLEM WILL BE UNCOVERED FIRST

61

But what if?

EXPERIMENTERSx3

INFLUENCED BY THE KNOWLEDGE

OF PROBLEM FREQUENCY

62

Experiment 3“ Can expert make judgments of problem

severity without access to frequency information?”

63

Experiment 3

x 6


x2

EXPERTS

64

Experiment 3

Voice Responsive System

65

Result

USABILITY PROBLEMS

17

66

Result

67

Result

OUTPUT

5 subjects tended to find about 85%of the usability problems

69

Agreement Among Judges

EXPERIMENTERS EXPERTSUSABILITY PROBLEMS

17

70

Assigned Ratings

High Medium Low

71

Result

OUTPUT

As problem severity increases, the likelihood that the problem will be detected within the first few subjects also increases.

73

Agreement Among Experts and Experimenters

74

Agreement Among Experts and Experimenters

W(16) = 0.471Kendall's Coefficient of Concordance

OUTPUT

Experts can judge problem severity without frequency information

03 OverallDiscussion

@seoul_victoria

77

Overall Discussion

1. The first 5 users find 85% of problems in a usability test

2. Additional subjects are less and less likely to reveal new information

3. Severe problems are more likely to be detected by the first few users

4. Experts can judge problems’ severity without access to frequency information

78

Formula for Sample Size

1-(1-p)n

79

Formula for Sample Size

1-(1-p)n

80

Is it that simple?

USABILITY PROBLEMS

80%x5SUBJECTS

81

Is it that simple?

USABILITY PROBLEMS

80%x5SUBJECTS

The problems you will find affect 32% of users

82

Is it that simple?

1-(1-p)n

p=0.32

83

What p to use?

New Application32%10% Already Released

Application

04Calculating Sample Size

85

How many subjects are required to identify any problem experienced by 10% or more of the population at the

90% confidence level?

Calculating Sample Size

86


1-(1-p)n

p - probability of detecting a given problemn - the sample of size

87

0.9 = 1-(1-p)n


88


0.9 = 1-(1-0.1)n

89


0.1=(0.9)n

90


log(0.1)=n log(0.9)

91


n=log(0.1)/log(0.9)

92


n=21.85n=log(0.1)/log(0.9)

93

22 users needed to have a 90% likelihood of detecting problem

that will be experienced by 10% of people

Determine # of Subject

94


n=log(1-Chance of Detecting)/log(1-Probability of Occurring)

05@seoul_victoria

Reading List

96

Reading List• A Brief History Of The Magic Number 5 In Usability

Testing http://www.measuringu.com/blog/five-history.php

• Al-Awar, J., Chapanis, A., and Ford, R. (1981). Tutorials for the first-time computer user. IEEE Transactions on Professional Communication, 24, 30-37.

• Lewis, J. R. (1982). Testing Small System Customer Setup. in Proceedings of the Human Factors Society 26th Annual Meeting (pp. 718-720). Santa Monica, CA: HFES.on Human factors in computing systems, March 31-April 05, Seattle, Washington.

• Virzi, R. A. (1992). Refining the test phase of usability evaluation: How many subjects is enough? Human Factors, 34, 457-471.


97

Reading List

• Nielsen, J., & Landauer, T. K. (1993). A mathematical model of the finding of usability problems. In Proceedings of the SIGCHI conference on Human factors in computing systems (pp.206-213). Amsterdam: ACM.

• Lewis, J. R. (1993). Problem discovery in usability studies: A model based on the binomial probability formula. In Proceedings of the Fifth International Conference on Human-Computer Interaction (pp. 666-671). Orlando, FL: Elsevier.

• Lewis, J. R. (1994). Sample sizes for usability studies: Additional considerations. Human Factors, 36, 368-378.

98

Reading List• Caulton, D. A. (2001). Relaxing the homogeneity assumption

in usability testing. Behaviour & Information Technology, 20, 1-7.

• Spool J., & Schroeder W. (2001). Testing web sites: five users is nowhere near enough, CHI '01 extended abstracts on Human factors in computing systems, March 31-April 05, Seattle, Washington.

• Perfetti, C., & Landesman, L. (2001). Eight is not enough. Retrieved July 15, 2010 from

• Turner, C. W., Lewis, J. R., & Nielsen, J. (2002). UPA Panel: How many users is enough? Determining usability test sample size

99

Reading List• Wixon, D. (2003) Evaluating usability methods: why the

current literature fails the practitioner, interactions, v.10 n.4, July + August.

• Lewis, J. R., 2001, Evaluation of procedures for adjusting problem-discovery rates estimated from small samples. International Journal of Human-Computer Interaction, 13, 445-479.

• Hertzum, M. & Jacobsen, N. J. (2003 – corrected version, original published in 2001). The evaluator effect: A chilling fact about usability evaluation methods. International Journal of Human-Computer Interaction, 15, 183-204.

100

Reading List• Woolrych, A. & Cockton, G., (2001), Why and when five test

users aren't enough. In Vanderdonckt, J., Blandford, A. and Derycke A. (eds.) Proceedings of IHM-HCI 2001 Conference, Vol. 2 (Toulouse, France: Cépadèus Éditions), pp. 105-108.

• Bevan, N., Barnum, C., Cockton, G., Nielsen, J., Spool, J., and Wixon, D. 2003. The "magic number 5": is it enough for web testing?. In CHI '03 Extended Abstracts on Human Factors in Computing Systems (Ft. Lauderdale, Florida, USA, April 05 - 10, 2003). CHI '03. ACM, New York, NY, 698-699

• Turner, C. W., Lewis, J. R., & Nielsen, J. (2006). Determining usability test sample size. In W. Karwowski (ed.), International Encyclopedia of Ergonomics and Human Factors (pp. 3084-3088). Boca Raton, FL: CRC Press.

101

Reading List• Lewis, J. R. (2006). Sample sizes for usability tests: mostly

math, not magic. interactions 13, 6 (Nov. 2006), 29-33.

• Lindgaard, G., & Chattratichart, J. (2007). Usability testing: what have we overlooked?. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (San Jose, California, USA, April 28 - May 03, 2007). CHI '07. ACM, New York, NY, 1415-1424.

• Schmettow, M. (2008), "Heterogeneity in the Usability Evaluation Process," in Proceedings of the 22nd British HCI Group Annual Conference on HCI 2008: People and Computers XXII: Culture, Creativity, Interaction - Volume 1, ACM, Liverpool, UK, pp. 89-98.

Thank you

Discussion