statistical methods for testing the visual quality of displays

8
Statistical methods for testing the visual quality of displays David Travis*, Tom Stewart System Concepts Ltd., 2 Savoy Court, Strand, London WC2R 0EZ, UK Received 17 August 1996; revised 17 December 1996; accepted 17 December 1996 Abstract Although the business benefits of usability testing are widely accepted, it is still rare to find visual displays tested against user performance standards. Part of the reason for this is the assumption that, to be valid, user performance testing requires a large number of subjects and hence takes too long. In this paper we review a range of statistical procedures that allow decisions to be made on the acceptability of a display using the minimum number of subjects, and show that valid decisions can be obtained with as few as 15 subjects. q 1997 Elsevier Science B.V. Keywords: Standards; Usability; Sequential statistics; Image quality; User performance 1. Introduction Ergonomics has traditionally been concerned with ensuring that the design of equipment and the working environment takes sufficient account of the strengths and limitations of the people who are expected to use them. There is ample scope in today’s office for such activity. Much productivity is lost in the office as a result of poorly designed equipment and inappropriate work-places. But by far the most widely publicised issue concerns the alleged health hazards associated with the introduction of visual display terminals (VDTs) into the office. Various risks have been identified including eyestrain, postural pro- blems, repetitive strain injuries, facial dermatitis, adverse pregnancy outcomes and other dramatic and frightening maladies. Many of these are unproved but a significant number are well established and ergonomics related. 1.1. Standards as a means of improving VDT work In recent years, there have been a number of attempts to develop national and international standards for VDT ergo- nomics as a means of overcoming these problems. One of the most important activities in the long term is the work of the International Standards Organisation. ISO identifies four distinct sets of aims for standards, any or all of which may be embraced by an ISO standard. These specific aims include: mutual understanding; health, safety and the protection of the environments; interface and interchangeability; fitness for purpose. Despite the improvements which have taken place in hardware design, it is clear that there is still much that ergonomics standards could achieve in the above areas. 1.2. User performance standards A major criticism of early standards in this field is that they were based on product design features such as height of characters on the screen. Such standards are very specific to cathode ray tube (CRT) technology and do not readily apply to other display technologies. They may therefore inhibit innovation and force designers to stick to old solutions. More importantly, the standards specify values for a range of different parameters quite independently and take little account of the immense interactions which take place in real use, for example between display characteristics and the environment. The standards can also be criticised for being more precise than is reasonable in areas where research is still continuing. An alternative approach has been proposed by a sub-com- mittee (SC4) of the Ergonomics Technical Committee (TC159) of the International Organisation for Standardisa- tion (ISO). Here, the emphasis is placed on user per- formance standards. Thus, rather than simply specify a product feature such as character height which experts believe will result in a legible display, ISO are developing procedures for testing directly such characteristics as legibility. The standard is then stated in terms of the user performance required from the equipment and not in terms 0141-9382/97/$17.00 q 1997 Elsevier Science B.V. All rights reserved PII S0141-9382(97)00001-2 * Corresponding author. Tel.: +44 171 240 3388; fax: +44 171 240 5212.

Upload: david-travis

Post on 02-Jul-2016

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Statistical methods for testing the visual quality of displays

Statistical methods for testing the visual quality of displays

David Travis*, Tom Stewart

System Concepts Ltd., 2 Savoy Court, Strand, London WC2R 0EZ, UK

Received 17 August 1996; revised 17 December 1996; accepted 17 December 1996

Abstract

Although the business benefits of usability testing are widely accepted, it is still rare to find visual displays tested against user performancestandards. Part of the reason for this is the assumption that, to be valid, user performance testing requires a large number of subjects and hencetakes too long. In this paper we review a range of statistical procedures that allow decisions to be made on the acceptability of a display usingthe minimum number of subjects, and show that valid decisions can be obtained with as few as 15 subjects.q 1997 Elsevier Science B.V.

Keywords:Standards; Usability; Sequential statistics; Image quality; User performance

1. Introduction

Ergonomics has traditionally been concerned withensuring that the design of equipment and the workingenvironment takes sufficient account of the strengths andlimitations of the people who are expected to use them.There is ample scope in today’s office for such activity.Much productivity is lost in the office as a result of poorlydesigned equipment and inappropriate work-places.

But by far the most widely publicised issue concerns thealleged health hazards associated with the introduction ofvisual display terminals (VDTs) into the office. Variousrisks have been identified including eyestrain, postural pro-blems, repetitive strain injuries, facial dermatitis, adversepregnancy outcomes and other dramatic and frighteningmaladies. Many of these are unproved but a significantnumber are well established and ergonomics related.

1.1. Standards as a means of improving VDT work

In recent years, there have been a number of attempts todevelop national and international standards for VDT ergo-nomics as a means of overcoming these problems. One ofthe most important activities in the long term is the work ofthe International Standards Organisation. ISO identifies fourdistinct sets of aims for standards, any or all of which may beembraced by an ISO standard. These specific aims include:

• mutual understanding;• health, safety and the protection of the environments;

• interface and interchangeability;• fitness for purpose.

Despite the improvements which have taken place inhardware design, it is clear that there is still much thatergonomics standards could achieve in the above areas.

1.2. User performance standards

A major criticism of early standards in this field is thatthey were based on product design features such as height ofcharacters on the screen. Such standards are very specific tocathode ray tube (CRT) technology and do not readily applyto other display technologies. They may therefore inhibitinnovation and force designers to stick to old solutions.

More importantly, the standards specify values for arange of different parameters quite independently and takelittle account of the immense interactions which take placein real use, for example between display characteristics andthe environment. The standards can also be criticised forbeing more precise than is reasonable in areas whereresearch is still continuing.

An alternative approach has been proposed by a sub-com-mittee (SC4) of the Ergonomics Technical Committee(TC159) of the International Organisation for Standardisa-tion (ISO). Here, the emphasis is placed on user per-formance standards. Thus, rather than simply specify aproduct feature such as character height which expertsbelieve will result in a legible display, ISO are developingprocedures for testing directly such characteristics aslegibility. The standard is then stated in terms of the userperformance required from the equipment and not in terms

Displays 18 (1997) 29±36

0141-9382/97/$17.00q 1997 Elsevier Science B.V. All rights reservedPII S0141-9382(97)00001-2

* Corresponding author. Tel.: +44 171 240 3388; fax: +44 171 240 5212.

Page 2: Statistical methods for testing the visual quality of displays

of how that is achieved. The performance measure includesspeed and accuracy and the avoidance of discomfort.

Such user performance standards have a number ofadvantages. They are:

• relevant to the real problems experienced by users;• tolerant of developments in the technology;• flexible to cope with interactions between factors.

However, they also suffer a number of disadvantages.They cannot be totally complete and scientifically valid inall cases. They represent reasonable compromises andobtaining the agreement of all the parties in standardssetting takes time.

1.3. The test method

We have critically evaluated the original ISO procedurein ISO 9241-3 [1]. Based partly on our work, the ISO work-ing group have now recommended the adoption of a newtest procedure [2,3] that will use statistical methodsdeveloped in our earlier study. The aim of this paper is tocritically review these procedures; and address in detail thereliability and validity of Barnard’s test.

Psychologists are familiar with the use ofnon-sequentialstatistical procedures to accept or reject an experimentalhypothesis, butsequentialprocedures are less well known.In fact, this approach offers value in behavioural testing,since sequential analysis frequently requires fewer subjectsto accept or reject an experimental hypothesis [4].

1.4. General theory

Statistical decisions are prone to two kinds of error. Thefirst type of error (Type 1) occurs when the null hypothesisis falsely rejected, the second type of error (Type 2) occurswhen the null hypothesis is falsely accepted. These two risksare usually symbolised bya andb (see Table 1).

In non-sequential testing the sample size in an experimentshould be fixed in advance1 by using the following formula(adapted from [5], p. 330):

N ¼2(ma þ mb)2

D2 ,

where ma and mb are the normal deviates (z scores)corresponding toa andb respectively, andD is expressedin standard deviation units. So, for example, ifa andb areboth set to 0.05 and we wish to detect a difference betweenthe means of half a standard deviation, then:

N ¼2(1:65þ 1:65)2

0:52 < 87

and hence at least 87 subjects should be tested.Sequential analysis was devised independently by Wald

[6–8] in the United States and by Barnard [9] in Englandduring World War II. The main feature of sequentialanalysts is that the sample size is not determined in advance;instead, the validity of the null hypothesis is tested aftereach set of results has been collected. This has obviousvalue in behavioural science and in conformance testing,since testing subjects is expensive.

1.5. Types of sequential test

The first major report on sequential analysis was preparedby the Statistical Research Group at Columbia University[8] and describes 5 sequential procedures, all based on thework of Wald:

1. Sequential analysis when the result of a single observa-tion is a classification as good or bad and when the resultof the test is acceptance or rejection (the Binomial test).

2. Sequential analysis when the result of a single observa-tion is a classification as good or bad and when theresult of the test is a decision between two methods orproducts.

3. Sequential analysis when the quality being tested ismeasured and when the question is whether a standardis exceeded or fallen short of (the Normal Curve test).

4. Sequential analysis when the quality being tested ismeasured and when the question is whether a lot differsfrom a standard.

5. Sequential analysis of variability of quality about theaverage.

This paper considers only the first and third items fromthis list since the others are not applicable to the perfor-mance test methods in ISO 9241-3. In addition, Barnard’stest is described and analysed in some detail.

Table 1The types of decision that can be made using a statistical test

Decision after testingTest display accepted Test display rejected

Test display at least as good asBenchmark display

Correct decision Manufacturer’s riska

TruthTest display worse than Benchmarkdisplay

User’s riskb

Correct decision

1 In practice, few scientists follow this procedure.

30 D. Travis, T. Stewart/Displays 18 (1997) 29–36

Page 3: Statistical methods for testing the visual quality of displays

1.6. Assumptions

Barnard’s test and the Normal Curve test are parametric:they require normally distributed data. The Binomial testassumes a binomial distribution. In addition, the NormalCurve Test assumes that the standard deviations of theStandard and the Test sample are the same and that thisvalue is known in advance.

2. Numerical recipes

This section provides recipes for performing Barnard’stest, the Normal Curve Test, and the Binomial Test.2 Formore detailed analyses, see Ref. [8]. For all tests, it isassumed that the Experimenter wishes to test conformanceof a Test display to a standard using behavioural methods.For Barnard’s Test and for the Normal Curve Test, the Testdisplay is compared with a Benchmark display (which justmeets the standard) and the two are compared on somemeasure (in the examples given here the dependent measureis speed of use measured in seconds). Since statistical testscannot be used to prove identity, they are used here todecide if performance for the Test display is significantlyworse than performance for the Benchmark display. Ifnot, the Test display is considered to conform to thestandard.

Hence, the null hypothesis,H0, is that there is no differ-ence between the speed of use of the Test and Benchmarkdisplays. The alternative hypothesis,H1, is that the speed of

use for the Test display is significantly slower than for theBenchmark display.

2.1. Barnard’s U

This test is used to compare scores (such as speed of use)for a Test and a Benchmark product (see Table 2).

For example, consider the following worked example.x1 and x0 denote scores for speed of use (in seconds)for a Test display and a Benchmark display respectively (seeTable 3).

It can be seen that after 12 subjects,U . U1, and so thealternative hypothesis is accepted (that is, the Test displayfails): speed of use of the Test display is significantly slowerthan for the Benchmark display.

2.2. Normal curve

This test is used under exactly the same conditions asBarnard’s test. The main difference is that this test requiresthe standard deviation of the Test and Benchmark scores tobe the same, and this value must be known in advance (seeTable 4).

As an example, consider the case wherea ¼ 0.05,b ¼ 0.05 (see Table 5). The standard deviation of responsetimes to the Test and Benchmark displays is known to bej ¼ 12.0 seconds and it is considered important to detect adifference of half of one standard deviation, that isd ¼ 6.0seconds. Then we have:

a¼ ln(1¹ b)

a

� �

¼ 2:944,

b¼ ln(1¹ a)

b

� �

¼ 2:944,

Table 2Numerical recipe to carry out Barnard’s test

Barnard’sU Test

1 (i) Notea, the risk of asserting a significant difference whenthe displays are the same, andb,the risk of asserting nosignificant difference when the displays are in fact different

a, b

(ii) Note D, the difference between the means that itisimportant to detect

D

2 For each subject, obtain a score for the Benchmarkdisplay (x0)and for the Test display (x1).

x0, x1

3 Compute the difference score x1 ¹ x0

4 ComputeT, the sum of the difference scores for all subjects tested T ¼ ox1 ¹ x0

5 ComputeS, the sum of the squared differences S¼ o(x1 ¹ x0)2

6 Compute theU statistic U ¼T���

Sp

7 This statistic is then compared with boundary values,U0 andU1,according to the appropriate values ofa,b andD (see Appendix B).If U , U0 then the null hypothesis is accepted, and the Test display passes.If U . U1 then the alternative hypothesis is accepted, and the Test display fails.If U0 , U1, no decision can be made and testing must continue.

2 Spreadsheets (in Microsoft Excel for Windows and Excel for the Macin-tosh) and a program that runs under DOS have been written to carry outthese analyses; these can be obtained by contacting System Concepts Ltd.

31D. Travis, T. Stewart/Displays 18 (1997) 29–36

Page 4: Statistical methods for testing the visual quality of displays

h0 ¼bj2

d¼ ¹ 70:665,

h1 ¼aj2

d¼ 70:665:

After 10 subjects,T , T0 and the null hypothesis is accepted(that is, the Test display passes): there is no significantdifference in the speed of use of the Test and Benchmarkdisplays.

2.3. Binomial test

This test is used to accept or reject a test productwhen the result of a single observation is pass or fail (seeTable 6). This test is already used to evaluate child-resistantcontainers in DIN 55 559 [10] and BS6652 [11].

For example, consider the case where a Test display ismeasured for its susceptibility to flicker.a, the risk of reject-ing a good display, is set at 0.1.b, the risk of accepting a baddisplay is thought to be more serious and is set at 0.05. The

Table 3Worked example of Barnard’s test; values in parentheses are to assist in the drawing of boundaries but cannot be used to make a pass or fail decision (see [12])

S x1 x0 x1 ¹ x0 T S U U0 U1

1 9.78 7.92 1.866 1.866 3.48 1.0002 17.19 14.48 2.703 4.568 10.8 1.391 [6.96]3 38.32 39.39 ¹1.078 3.491 11.9 1.010 [¹5.045]4 16.08 14.20 1.88 5.371 15.5 1.365 [¹3.130] [3.010]5 13.56 12.17 1.39 6.761 17.4 1.620 [¹2.600] [2.870]6 19.57 11.45 8.122 14.88 83.4 1.630 ¹2.070 [2.730]7 6.26 6.38 ¹0.123 14.76 83.4 1.616 ¹1.790 [2.645]8 8.20 7.06 1.135 15.89 84.7 1.727 ¹1.510 2.5609 24.16 22.23 1.93 17.82 88.4 1.896 ¹1.330 2.510

10 10.35 7.90 2.445 20.27 94.4 2.086 ¹1.150 2.46011 13.83 10.37 3.46 23.73 106 2.301 ¹1.034 2.43612 12.21 6.97 5.24 28.97 134 2.504 ¹0.918 2.412

Table 4Numerical recipe to carry out the Normal Curve test

Normal Curve test

1 (i) Notea, the risk of asserting a significant difference whenthe displays are the same, andb, the risk of asserting nosignificant difference when the displays are in fact different

a, b

(ii) Note j, the standard deviation (assumed to be the same forthe best and Benchmark displays)

j

(iii) Note d, the difference between the means that it is importantto detect

d

2 Computea a¼ ln(1¹b)

a

� �

3 Computeb b¼ ln(1¹a)

b

� �

4 Computeh0 h0 ¼ ¹bj2

d

5 Computeh1 h1 ¼aj2

d

6 Compute the limitsT0 andT1 for the current sample sizen T0 ¼ h0 þnj

2

T1 ¼ h1 þnj

2

7 For each subject, obtain a score for the Benchmarkdisplay (x0) andfor the Test display (x1)

x0, x1

8 Compute the difference score x0 ¹ x1

9 ComputeT, the sum of the difference scores for all subjects tested T ¼ ox1 ¹ x0,

10 If T , T0 then the null hypothesis is accepted, and the Test display passes.If T . T1 then the alternative hypothesis is accepted, and the Test display fails.If T0 , T1, no decision can be made and testing must continue.

32 D. Travis, T. Stewart/Displays 18 (1997) 29–36

Page 5: Statistical methods for testing the visual quality of displays

unacceptable number of subjects reporting flicker is set at10% (i.e.p1 ¼ 0.1) and the acceptable level is set at 8% (i.e.p0 ¼ 0.08 (see Table 7). Then:

h0 ¼

log1¹ 0:1

0:05

� �

log0:100:08

31¹ 0:081¹ 0:1

� �� �¼ 11:7915,

h1 ¼

log1¹ 0:05

0:1

� �

log0:100:08

31¹ 0:081¹ 0:1

� �� �¼ 9:1844,

log1¹ 0:08

0:1

� �

log0:100:08

31¹ 0:081¹ 0:1

� �� �¼ 0:0897:

After 15 subjects,d ¼ d1 and the alternative hypothesis isaccepted, that is, the Test display flickers.

3. How reliable and valid is Barnard’s test?

This section describes a detailed analysis of Barnard’stest using Monte Carlo simulations. The aim of this section

Table 5Worked example of the Normal Curve test

S x1 x0 x1¹x0 T ¼ ox1¹x0 T0 T1

1 60.01 61.81 ¹1.799 ¹1.8 ¹67.67 73.672 31.19 32.38 ¹1.184 ¹2.98 ¹64.67 76.673 34.21 37.94 ¹3.732 ¹6.72 ¹61.67 79.674 21.95 28.12 ¹6.165 ¹12.9 ¹58.67 82.675 43.12 44.34 ¹1.218 ¹14.1 ¹55.67 85.676 27.11 29.99 ¹2.884 ¹17 ¹52.67 88.677 36.53 47.3 ¹10.764 ¹27.7 ¹49.67 91.678 26.12 31.06 ¹4.936 ¹32.7 ¹46.67 94.679 35.31 39.7 ¹4.392 ¹37.1 ¹43.67 97.67

10 29.04 34.69 ¹5.645 ¹42.7 ¹40.67 100.67

Table 6Numerical recipe to carry out the Binomial test

Binomial Test

1 (i) Notea, the risk of asserting a significant difference whenthe displays are the same, andb, the risk of asserting nosignificant difference when the displays are in fact different

a, b

(ii) Note p0, the acceptable quality limit defined as fraction defective p0

(iii) Note p1, the unacceptable quality limit defined as fraction defective p1

2 Computeh0 h0 ¼

log1¹a

b

� �

logp1

p0

1¹ p0

1¹ p1

� �� �

3 Computeh1 h1 ¼

log1¹b

a

� �

logp1

p0

1¹ p0

1¹ p1

� �� �

4 Computes s¼log

1¹ p0

1¹ p1

� �

logp1

p0

1¹ p0

1¹ p1

� �� �

5 Compute the limitsd0 andd1 for the current sample sizen. If d0 isnot an integer, round it down to the largest integer, d0.

d0 ¼ ¹ h0 þ sn

If d1 is not an integer, round it up to the smallest integer. d0. d1 ¼ h1 þ sn

6 Computed, the sum of the defective items d ¼ odefects

7 If d # d0 then the null hypothesis is accepted, and the Test display passes.If d $ d1 then the alternative hypothesis is accepted, and the Test display fails.If d0 , d , d1, no decision can be made and testing must continue.

33D. Travis, T. Stewart/Displays 18 (1997) 29–36

Page 6: Statistical methods for testing the visual quality of displays

is to judge the robustness of Barnard’s test using simulateddata for a Test display; these data have widely-varyingstandard deviations and mean values.

3.1. Simulation of data

To evaluate Barnard’s test, a simulated array ofBenchmark display data were produced. These hadx¼ 0andj ¼ 1. These data were then tested against a simulatedarray of Test display data. The mean of the Test display datavaried betweenx¼ ¹ 0:60 andx¼ þ 1:50 in steps of 0.05(hence there were 42) different values ofx); for each ofthese mean values, the standard deviation was set atj ¼ 0.2, 0.4, 0.6, 0.8, 0.9, 1.0, 1.1, 1.2, 1.4, 1.6, 1.8 and 2.03

(hence there were 12 values ofj). In total, there were42 3 12 ¼ 504 distributions of Test display data. The totalsize of both the Test and Benchmark array was set at 87,since fora ¼ b ¼ 0.05 a conventionalt-test should be usedfor sample sizes in excess of this (see above).

For eachx andj, 100 sets of Benchmark and Test datawere compared using Barnard’s test. The following statis-tics were derived from this comparison: (a) the percentagepasses for the simulated Test display; (b) the percentage of‘no decisions’ after testing 87 simulated subjects; and (c) theaverage sample size required to make a decision (either passor fail).

The data were generated in the following way. For therequired distribution, a mean value,x, and a standard devia-tion, j, were defined. Next, the data were quantized into 15bins ranging between6 3.75 standard deviations of themean value,x. For example, for the Benchmark displaywherex¼ 0 andj ¼ 1 the mean value of each bin rangedfrom x¼ x¹ 3:75j ¼ ¹ 3:75 to x¼ xþ 3:75j ¼ 3:75. The

range of each bin was defined as the range of the distributiondivided by the number of bins; for this example, (3.753 2)/15¼ 0.5. The mean of each successive bin was then simplythe smallest value (3.75 in this example) plus the range (0.5in this example). The number of observations in each bin,n,were then computed from the expression for the normalcurve, that is:

n¼1������

2pp exp ¹

12

x¹ x

j2

� �2� �

and n random numbers were generated within each binbetween the lower and upper limits of that bin. Finally,the array of numbers produced was itself randomised tosimulate a random sample of subjects arriving at thelaboratory.

3.2. Results

3.2.1. Type I and Type II errorsTable 1 shows the types of decision that can he made

using a statistical test. There are two types of correct deci-sions: the display can be correctly accepted or correctlyreflected. But there are also two types of errors: the displaycan be falsely rejected or falsely accepted. We havecaptured this information in Fig. 1.

For each set of simulations, the percentages of Type I (a)and Type II (b) errors were computed. For example, whenthe simulated array of Test display data had a mean of 0 orless, and the decision from Barnard’s test was to (falsely)reject the display, this was a Type I error. Similarly, whenthe simulated array of Test display data had a mean greaterthan 0 and the decision from Barnard’s test was to (falsely)accept the display, this was a Type II error.

Note that errors in the 0 to 0.5 range of test sample meansare acceptable, since the test was asked to distinguish onlythose means that differed byD ¼ 0.5 or more. Type I andType II errors of less than 5% can be ignored since thesewere the acceptable parameters ofa andb.

Table 7Worked example of the Binomial test

S Flicker reported? d ¼ odefects d0 d1

1 No 0 0 102 No 0 0 103 Yes 1 0 104 Yes 2 0 105 Yes 3 0 106 No 3 0 107 Yes 4 0 108 Yes 5 0 109 Yes 6 0 1010 Yes 7 0 1111 Yes 8 0 1112 Yes 9 0 1113 No 9 0 1114 Yes 10 0 1115 Yes 11 0 11

3 When expressed as a proportion of the standard deviation of the Bench-mark display, the range of standard deviations of the Test displays in ourevaluations of an earlier version of the test procedure range from 1.05 to1.29.

34 D. Travis, T. Stewart/Displays 18 (1997) 29–36

Page 7: Statistical methods for testing the visual quality of displays

It can be seen that in our simulations, the number of TypeI errors were kept within acceptable bounds. From themanufacturer’s point of view therefore, the statistical testis a good one since it did not falsely reject any ‘good’displays.

The extent of Type II errors depends on the standarddeviation of the test sample. When the standard deviationof the Test sample is the same as the Benchmark sample,then Type II errors are within acceptable bounds (i.e. lessthan 5%). however, as the standard deviation of the Testsample increases, the number of Type II errors increase.This is an issue only when the mean of the Test sample isclose to that of the Benchmark sample-that is when the Testdisplay only just fails. Note that the data we have so farobtained suggest that the actual range of standard deviationsis likely to be considerably less than that shown theoreticallyin Fig. 1; our measured range of standard deviations using anearlier version of the test method is between 1.05 to 1.29.

3.2.2. Percentage of ‘no decisions’On some occasions, it was not possible to come to a

decision after testing all of the sample of 87. This wasmost likely to happen when the mean of the Test samplewas only slightly greater (usually within half a standarddeviation) than the mean of the Benchmark sample.Depending on the standard deviation of the Test sample,the worst cases vary between about 25% to about 40%.

The main condition under which Barnard’s test appearsleast efficient with subjects is when the mean value of the

Test display is only marginally slower (once again, abouthalf a standard deviation) than the mean value of the Bench-mark display. This can be explained by the fact thatD, thedifference between the means that it was considered impor-tant to detect, was set at 0.5.

3.3. How many subjects are needed?

Finally, it is useful to consider the average sample sizesthat are required to come to a decision when using Barnard’stest. Since the data are qualitatively the same for all standarddeviations, only the graph for a standard deviation of 1.0 isshown in Fig. 2.

Note that when the mean of the Test sample is the same orquicker than the mean of the Benchmark sample, a decisioncan be made with about 15 subjects or less. Similarly, whenthe mean of the Test sample is more than one standarddeviation slower than the Benchmark sample, no morethan about 15 subjects are required. Even for the worsecases, the average sample size is never greater than 45(although of course these are averages, and so individualtests will sometimes require more or less subjects). Otherdata (not presented here) show that the sample size does notvary dramatically with variations in the standard deviationof the Test sample.

4. Conclusion

1. Barnard’s test provides a rapid and reliable method to

Fig. 1. Distribution of Type I and Type II errors in a space defined by themean and standard deviation of the test sample. The degree of shading (seekey at top of figure) shows the probability of making a Type I or Type IIerror.

Fig. 2. Average sample numbers required for the condition where the stan-dard deviation of the Test sample was the same as the standard deviation ofthe Benchmark sample. There are no qualitative variations in the structureof these data with variations in the standard deviation of the test sample.Samples with which it was not possible to come to a decision after testing87 subjects are excluded from these data.

35D. Travis, T. Stewart/Displays 18 (1997) 29–36

Page 8: Statistical methods for testing the visual quality of displays

measure the performance of a test display against abenchmark.

2. As the test method in ISO 9241-3 develops further, theNormal Curve test could be used to assess a test displayagainst an absolute benchmark. Data are needed to pro-vide an accurate estimate of the standard deviation forthe task.

3. The Binomial test could be applied to the flicker methodin ISO 9241-3, although some discussion is needed onthe appropriate values for the parametersa, b, p1 andp0.

4. Barnard’s test is quite robust to changes in the standarddeviation of the test sample. Predictably, the mostnumber of subjects are required when the mean of theTest sample is aboutD standard deviations from theBenchmark sample (whereD is the difference betweenthe means that is considered important). But even underthis worst case, the test does not require more than about45 subjects to reach a decision.

Acknowledgements

As with all developing and emerging standards, thisstatistical procedure has benefited from the comments ofnumerous people. We would particularly like to thank allpast and current members of ISO TC159 SC4 WG2. Thework described in this report was supported by grant number

3393/R55.052 from the United Kingdom Health and SafetyExecutive.

References

[1] D.S. Travis, T.F.M. Stewart, C. Mackay, Evaluating image quality,Displays 13 (1992) 139–146.

[2] J.A.J. Roufs, M.C. Boschman, Text quality metrics for visual displayunits: I. Methodological aspects, Displays (1997), this issue.

[3] M.C. Boschman, J.A.J. Roufs, Text quality metrics for visual displayunits: II. An experimental survey, Displays (1997), this issue.

[4] F.R. Brigham, Statistical methods for testing the conformance of pro-ducts to user performance standards, Behaviour and InformationTechnology 8 (1989) 279–283.

[5] W.L. Hays, Statistics, Holt, Rinehart and Winston, New York, 1963,.[6] A. Wald, Sequential Analysis, John Wiley, New York, 1947.[7] A. Wald, J. Wolfowitz, Sampling inspection plans for continuous

production which ensure a prescribed limit on the outgoing quality,Annals of Mathematical Statistics 16 (1945) 30–49.

[8] Statistical Research Group, Columbia University, Sequential Analysisof Statistical Data: Applications, Columbia University Press, NewYork, 1945.

[9] G.A. Barnard, Sequential tests in industrial statistics, J. Roy. Stat. Soc.8 (1946) 1–21.

[10] R. Gadeke, W. De Felice, Childproof packaging, Drugs Made in Ger-many 21 (1978) 3–17.

[11] BS 6652, 1989, 1989. Packages resistant to opening by children,British Standards Institute, London.

[12] Davies, The design and analysis of industrial experiments, London,Oliver and Boyd, 1954.

36 D. Travis, T. Stewart/Displays 18 (1997) 29–36