the state of the art in automating usability evaluation of user...

The State of the Art in Automating Usability Evaluationof User Interfaces

MELODY Y. IVORY AND MARTI A. HEARST

University of California, Berkeley

Usability evaluation is an increasingly important part of the user interface designprocess. However, usability evaluation can be expensive in terms of time and humanresources, and automation is therefore a promising way to augment existingapproaches. This article presents an extensive survey of usability evaluation methods,organized according to a new taxonomy that emphasizes the role of automation. Thesurvey analyzes existing techniques, identifies which aspects of usability evaluationautomation are likely to be of use in future research, and suggests new ways to expandexisting approaches to better support usability evaluation.

Categories and Subject Descriptors: H.1.2 [Information Systems]: User/MachineSystems—human factors; human information processing; H.5.2 [InformationSystems]: User Interfaces—benchmarking; evaluation/methodology; graphical userinterfaces (GUI )

General Terms: Human Factors

Additional Key Words and Phrases: Graphical user interfaces, taxonomy, usabilityevaluation automation, web interfaces

1. INTRODUCTION

Usability is the extent to which a computersystem enables users, in a given contextof use, to achieve specified goals effec-tively and efficiently while promoting feel-ings of satisfaction.1 Usability evaluation(UE) consists of methodologies for mea-suring the usability aspects of a system’suser interface (UI) and identifying specificproblems [Dix et al. 1998; Nielsen 1993].

1 Adapted from ISO9241 [International StandardsOrganization 1999].

This research was sponsored in part by the Lucent Technologies Cooperative Research Fellowship Program,a GAANN fellowship, and Kaiser Permanente.

Authors’ addresses: M. Y. Ivory, Computer Science Division, University of California, Berkeley, Berkeley,CA 94720-1776; email: [email protected]; M. A. Hearst, School of Information Management andSystems, University of California, Berkeley, Berkeley, CA 94720-4600; email: [email protected].

Permission to make digital/hard copy of part or all of this work for personal or classroom use is grantedwithout fee provided that the copies are not made or distributed for profit or commercial advantage, thecopyright notice, the title of the publication, and its date appear, and notice is given that copying is bypermission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists,requires prior specific permission and/or a fee.c©2001 ACM 0360-0300/01/1200–0470 $5.00

Usability evaluation is an important partof the overall user interface design pro-cess, which consists of iterative cyclesof designing, prototyping, and evaluating[Dix et al. 1998; Nielsen 1993]. Usabilityevaluation is itself a process that entailsmany activities depending on the methodemployed. Common activities include.

—Capture collecting usability data, suchas task completion time, errors, guide-line violations, and subjective ratings;

—Analysis interpreting usability data toidentify usability problems in the inter-face; and

ACM Computing Surveys, Vol. 33, No. 4, December 2001, pp. 470–516.

Automating Usability Evaluation of User Interfaces 471

—Critique: suggesting solutions or im-provements to mitigate problems.

A wide range of usability evaluationtechniques have been proposed, and a sub-set of these is currently in common use.Some evaluation techniques, such as for-mal user testing, can only be applied af-ter the interface design or prototype hasbeen implemented. Others, such as heuris-tic evaluation, can be applied in the earlystages of design. Each technique has itsown requirements, and generally differ-ent techniques uncover different usabilityproblems.

Usability findings can vary widely whendifferent evaluators study the same userinterface, even if they use the same eval-uation technique [Jeffries et al. 1991;Molich et al. 1998, 1999; Nielsen 1993].Two studies in particular, the first andsecond comparative user testing studies(CUE-1 [Molich et al. 1998] and CUE-2[Molich et al. 1999]), demonstrated lessthan a 1% overlap in findings among fourand eight independent usability testingteams for evaluations of two user inter-faces. This result implies a lack of sys-tematicity or predictability in the findingsof usability evaluations. Furthermore, us-ability evaluation typically only covers asubset of the possible actions users mighttake. For these reasons, usability expertsoften recommend using several differentevaluation techniques [Dix et al. 1998;Nielsen 1993].

How can systematicity of results andfuller coverage in usability assessment beachieved? One solution is to increase thenumber of usability teams evaluating thesystem and to increase the number ofstudy participants. An alternative is to au-tomate some aspects of usability evalua-tion, such as the capture, analysis, or cri-tique activities.

Automation of usability evaluation hasseveral potential advantages over nonau-tomated evaluation, such as the following.

—Reducing the cost of usability evalua-tion. Methods that automate capture,analysis, or critique activities can de-crease the time spent on usability eval-uation and consequently the cost. For

example, software tools that automati-cally log events during usability testingeliminate the need for manual logging,which can typically take up a substan-tial portion of evaluation time.

—Increasing consistency of the errorsuncovered. In some cases it is possibleto develop models of task completionwithin an interface, and software toolscan consistently detect deviations fromthese models. It is also possible todetect usage patterns that suggestpossible errors, such as immediate taskcancellation.

—Predicting time and error costs acrossan entire design. As previously dis-cussed, it is not always possible toassess every single aspect of an inter-face using nonautomated evaluation.Software tools, such as analyticalmodels, make it possible to widen thecoverage of evaluated features.

—Reducing the need for evaluation ex-pertise among individual evaluators.Automating some aspects of evaluation,such as the analysis or critique activi-ties, could aid designers who do not haveexpertise in those aspects of evaluation.

—Increasing the coverage of evaluatedfeatures. Due to time, cost, and resourceconstraints, it is not always possibleto assess every single aspect of aninterface. Software tools that generateplausible usage traces make it possibleto evaluate aspects of interfaces thatmay not otherwise be assessed.

—Enabling comparisons between alter-native designs. Because of time, cost,and resource constraints, usabilityevaluations typically assess only onedesign or a small subset of featuresfrom multiple designs. Some auto-mated analysis approaches, such asanalytical modeling and simulation,enable designers to compare predictedperformance for alternative designs.

—Incorporating evaluation within thedesign phase of UI development, asopposed to being applied after imple-mentation. This is important becauseevaluation with most nonautomated

ACM Computing Surveys, Vol. 33, No. 4, December 2001.

472 M. Y. Ivory and M. A. Hearst

methods can typically be done only afterthe interface or prototype has been builtand changes are more costly [Nielsen1993]. Modeling and simulation toolsmake it possible to explore UI designsearlier.

It is important to note that we considerautomation to be a useful complementand addition to standard evaluation tech-niques such as heuristic evaluation andusability testing—not a substitute. Differ-ent techniques uncover different kinds ofproblems, and subjective measures suchas user satisfaction are unlikely to be pre-dictable by automated methods.

Despite the potential advantages, thespace of usability evaluation automationis quite underexplored. In this article, wediscuss the state of the art in usabilityevaluation automation, and highlight theapproaches that merit further investiga-tion. Section 2 presents a taxonomy forclassifying UE automation, and Section 3summarizes the application of this tax-onomy to 132 usability methods. Sections4 through 8 describe these methods inmore detail, including our summative as-sessments of automation techniques. Theresults of this survey suggest promisingways to expand existing approaches tobetter support usability evaluation.

2. TAXONOMY OF USABILITY EVALUATIONAUTOMATION

In this discussion, we make a distinctionbetween WIMP (windows, icons, pointer,and mouse) interfaces and Web interfaces,in part because the nature of these inter-faces differs and in part because the us-ability methods discussed have often onlybeen applied to one type or the other inthe literature. WIMP interfaces tend to bemore functionally oriented than Web in-terfaces. In WIMP interfaces, users com-plete tasks, such as opening or saving afile, by following specific sequences of oper-ations. Although there are some functionalWeb applications, most Web interfacesoffer limited functionality (i.e., selectinglinks or completing forms), but the pri-

mary role of many Web sites is to provideinformation. Of course, the two types ofinterfaces share many characteristics; wehighlight their differences when relevantto usability evaluation.

Several surveys of UE methods forWIMP interfaces exist; Hom [1998] andHuman Factors Engineering [1999] pro-vide a detailed discussion of inspection, in-quiry, and testing methods (these termsare defined below). Several taxonomiesof UE methods have also been pro-posed. The most commonly used taxon-omy is one that distinguishes betweenpredictive (e.g., GOMS analysis and cog-nitive walkthrough, also defined below)and experimental (e.g., usability test-ing) techniques [Coutaz 1995]. Whitefieldet al. [1991] present another classificationscheme based on the presence or absenceof a user and a computer. Neither of thesetaxonomies reflects the automated aspectsof UE methods.

The sole existing survey of usabilityevaluation automation, by Balbo [1995],uses a taxonomy that distinguishes amongfour approaches to automation:

—Nonautomatic: methods “performed byhuman factors specialists”;

—Automatic Capture: methods that “relyon software facilities to record relevantinformation about the user and the sys-tem, such as visual data, speech acts,keyboard and mouse actions”;

—Automatic Analysis: methods that are“able to identify usability problems au-tomatically”; and

—Automatic Critic: methods that “notonly point out difficulties but proposeimprovements.”

Balbo uses these categories to classify13 common and uncommon UE methods.However, most of the methods surveyedrequire extensive human effort, becausethey rely on formal usability testingand/or require extensive evaluator inter-action. For example, Balbo classifies sev-eral techniques for processing log files asautomatic analysis methods despite thefact that these approaches require formal



testing or informal use to generate thoselog files. What Balbo calls an automaticcritic method may require the evaluator tocreate a complex UI model as input. Thusthis classification scheme is somewhatmisleading since it ignores the nonauto-mated requirements of the UE methods.

2.1. Proposed Taxonomy

To facilitate our discussion of the stateof automation in usability evaluation, wehave grouped UE methods along the fol-lowing four dimensions.

—Method Class: describes the type of eval-uation conducted at a high level (e.g.,usability testing or simulation);

—Method Type: describes how the evalua-tion is conducted within a method class,such as thinking-aloud protocol (usabil-ity testing class) or information proces-sor modeling (simulation class);

—Automation Type: describes the evalua-tion aspect that is automated (e.g., cap-ture, analysis, or critique); and

—Effort Level: describes the type of ef-fort required to execute the method (e.g.,model development or interface usage).

2.1.1. Method Class. We classify UEmethods into five method classes asfollows.

—Testing: an evaluator observes users in-teracting with an interface (i.e., com-pleting tasks) to determine usabilityproblems.

—Inspection: an evaluator uses a set of cri-teria or heuristics to identify potentialusability problems in an interface.

—Inquiry: users provide feedback on aninterface via interviews, surveys, andthe like.

— Analytical Modeling: an evaluator em-ploys user and interface models to gen-erate usability predictions.

—Simulation: an evaluator employs userand interface models to mimic a user in-teracting with an interface and reportthe results of this interaction (e.g., sim-

ulated activities, errors, and other quan-titative measures).

UE methods in the testing, inspection,and inquiry classes are appropriate forformative (i.e., identifying specific usabil-ity problems) and summative (i.e., obtain-ing general assessments of usability) pur-poses. Analytical modeling and simulationare engineering approaches to UE that en-able evaluators to predict usability withuser and interface models. Software engi-neering practices have had a major influ-ence on the first three classes, whereas thelatter two, analytical modeling and sim-ulation, are quite similar to performanceevaluation techniques used to analyze theperformance of computer systems [Ivory2001; Jain 1991].

2.1.2. Method Type. There is a wide rangeof evaluation methods within the testing,inspection, inquiry, analytical modeling,and simulation classes. Rather than dis-cuss each method individually, we grouprelated methods into method types; thistype typically describes how evaluation isperformed. We present method types inSections 4 through 8.

2.1.3. Automation Type. We adaptedBalbo’s automation taxonomy (describedabove) to specify which aspect of a usabil-ity evaluation method is automated.

—None: no level of automation supported(i.e., evaluator performs all aspects ofthe evaluation method);

—Capture: software automatically re-cords usability data (e.g., logging inter-face usage);

—Analysis: software automatically identi-fies potential usability problems; and

—Critique: software automates analysisand suggests improvements.

2.1.4. Effort Level. We also expandedBalbo’s automation taxonomy to includeconsideration of a method’s nonauto-mated requirements. We augment eachUE method with an attribute called effortlevel; this indicates the human effort re-quired for method execution.



Fig. 1 . Summary of our taxonomy for classifying usability evaluation methods. Theright side of the figure demonstrates the taxonomy with two evaluation methods dis-cussed in later sections.

—Minimal Effort: does not require inter-face usage or modeling.

—Model Development: requires the eval-uator to develop a UI model and/ora user model in order to employ themethod.

—Informal Use: requires completion offreely chosen tasks (i.e., unconstraineduse by a user or evaluator).

—Formal Use: requires completion of spe-cially selected tasks (i.e., constraineduse by a user or evaluator).

These levels are not necessarily orderedby the amount of effort required, since thisdepends on the method employed.

2.1.5. Summary. Figure 1 provides a syn-opsis of our taxonomy and demonstratesit with two evaluation methods. The tax-onomy consists of: a method class (test-ing, inspection, inquiry, analytical model-ing, and simulation); a method type (e.g.,log file analysis, guideline review, andsurveys); an automation type (none, cap-ture, analysis, and critique); and an effortlevel (minimal, model, informal, and for-

mal). In the remainder of this article, weuse this taxonomy to analyze evaluationmethods.

3. OVERVIEW OF USABILITY EVALUATIONMETHODS

We surveyed 75 UE methods applied toWIMP interfaces, and 57 methods appliedto Web UIs. Of these 132 methods, only29 apply to both Web and WIMP UIs.We determined the applicability of eachmethod based on the types of interfacesa method was used to evaluate in theliterature and our judgment of whetherthe method could be used with othertypes of interfaces. Table I combines sur-vey results for both types of interfacesshowing method classes (bold entriesin the first column) and method typeswithin each class (entries that are notbold in the first column). Each entry inColumns 2 through 5 depicts specificUE methods along with the automationsupport available and the effort requiredto employ automation. For some UE meth-ods, we discuss more than one approach;



Table I. Automation Support for WIMP and Web UE Methodsa

Method Class Automation TypeMethod Type None Capture Analysis Critique

TestingThinking-Aloud Protocol F (1)Question-Asking Protocol F (1)Shadowing Method F (1)Coaching Method F (1)Teaching Method F (1)Codiscovery Learning F (1)Performance Measurement F (1) F (7)Log File Analysis IFM (19)∗Retrospective Testing F (1)Remote Testing IF (3)

InspectionGuideline Review IF (6) (8) M (11)†Cognitive Walkthrough IF (2) F (1)Pluralistic Walkthrough IF (1)Heuristic Evaluation IF (1)Perspective-Based Inspection IF (1)Feature Inspection IF (1)Formal Usability Inspection F (1)Consistency Inspection IF (1)Standards Inspection IF (1)

InquiryContextual Inquiry IF (1)Field Observation IF (1)Focus Groups IF (1)Interviews IF (1)Surveys IF (1)Questionnaires IF (1) IF (2)Self-Reporting Logs IF (1)Screen Snapshots IF (1)User Feedback IF (1)

Analytical ModelingGOMS Analysis M (4) M (2)UIDE Analysis M (2)Cognitive Task Analysis M (1)Task-Environment Analysis M (1)Knowledge Analysis M (2)Design Analysis M (2)Programmable User Models M (1)

SimulationInformation Proc. Modeling M (9)Petri Net Modeling FM (1)Genetic Algorithm Modeling (1)Information Scent Modeling M (1)

Automation TypeTotal 30 6 8 1Percent 67% 13% 18% 2%

aA number in parentheses indicates the number of UE methods surveyed for aparticular method type and automation type. The effort level for each method isrepresented as: minimal (blank), formal (F), informal (I), and model (M).∗Indicates that either formal or informal interface use is required. In addition, amodel may be used in the analysis.†Indicates that methods may or may not employ a model.

hence, we show the number of meth-ods surveyed in parentheses beside theeffort level. Some approaches provideautomation support for multiple methodtypes (see Appendix A). Table I contains110 methods because some methods areapplicable to multiple method types;we also only depict methods applicable

to both WIMP and Web UIs once. Table IIprovides descriptions of all method types.

There are major differences in au-tomation support among the five methodclasses. Overall, automation patterns aresimilar for WIMP and Web interfaces,with the exception that analytical model-ing and simulation are far less explored



Table II. Descriptions of the WIMP and Web UE Method Types Depicted in Table I

Method ClassMethod Type Description

TestingThinking-Aloud Protocol user talks during testQuestion-Asking Protocol tester asks user questionsShadowing Method expert explains user actions to testerCoaching Method user can ask an expert questionsTeaching Method expert user teaches novice userCodiscovery Learning two users collaboratePerformance Measurement tester records usage data during testLog File Analysis tester analyzes usage dataRetrospective Testing tester reviews videotape with userRemote Testing tester and user are not colocated during test

InspectionGuideline Review expert checks guideline conformanceCognitive Walkthrough expert simulates user’s problem solvingPluralistic Walkthrough multiple people conduct cognitive walkthroughHeuristic Evaluation expert identifies violations of heuristicsPerspective-Based Inspection expert conducts narrowly focused heuristic evaluationFeature Inspection expert evaluates product featuresFormal Usability Inspection expert conducts formal heuristic evaluationConsistency Inspection expert checks consistency across productsStandards Inspection expert checks for standards compliance

InquiryContextual Inquiry interviewer questions users in their environmentField Observation interviewer observes system use in user’s environmentFocus Groups multiple users participate in a discussion sessionInterviews one user participates in a discussion sessionSurveys interviewer asks user specific questionsQuestionnaires user provides answers to specific questionsSelf-Reporting Logs user records UI operationsScreen Snapshots user captures UI screensUser Feedback user submits comments

Analytical ModelingGOMS Analysis predict execution and learning timeUIDE Analysis conduct GOMS analysis within a UIDECognitive Task Analysis predict usability problemsTask-Environment Analysis assess mapping of user’s goals into UI tasksKnowledge Analysis predict learnabilityDesign Analysis assess design complexityProgrammable User Models write program that acts like a user

SimulationInformation Proc. Modeling mimic user interactionPetri Net Modeling mimic user interaction from usage dataGenetic Algorithm Modeling mimic novice user interactionInformation Scent Modeling mimic Web site navigation

in the Web domain than for WIMP inter-faces (2 vs. 16 methods). Appendix A showsthe information in Table I separated by UItype.

Table I shows that automation in gen-eral is greatly underexplored. Methodswithout automation support represent67% of the methods surveyed, and meth-ods with automation support collectivelyrepresent only 33%. Of this 33%, capturemethods represent 13%, analysis methodsrepresent 18%, and critique methods rep-resent 2%. All but two of the capture meth-ods require some level of interface usage;genetic algorithms and information scent

modeling both employ simulation to gen-erate usage data for subsequent analysis.Overall, only 29% of all of the methods sur-veyed (nonautomated and automated) donot require formal or informal interfaceuse to employ.

To provide the fullest automation sup-port, software would have to critique in-terfaces without requiring formal or in-formal use. Our survey found that thislevel of automation has been developedfor only one method type: guideline re-view (e.g., Farenc and Palanque [1999],Lowgren and Nordvist [1992], and Scholtzand Laskowski [1998]). Guideline review



methods automatically detect and reportusability violations and then make sug-gestions for fixing them (discussed furtherin Section 5).

Of those methods that support the nextlevel of automation—analysis—Table Ishows that analytical modeling and sim-ulation methods represent the majority.Most of these methods do not require for-mal or informal interface use.

The next sections discuss the variousUE methods and their automation in moredetail. Some methods are applicable toboth WIMP and Web interfaces; however,we make distinctions where necessaryabout a method’s applicability. Our discus-sion also includes our assessments of auto-mated capture, analysis, and critique tech-niques using the criteria:

—Effectiveness: how well a method discov-ers usability problems,

—Ease of use: how easy a method is to em-ploy,

—Ease of learning: how easy a method isto learn, and

—Applicability: how widely applicable amethod is to WIMP and/or Web UIsother than to those originally applied.

We highlight the effectiveness, ease of use,ease of learning, and applicability of auto-mated methods in our discussion of eachmethod class. Ivory [2001] provides a de-tailed discussion of all nonautomated andautomated evaluation methods surveyed.

4. AUTOMATING USABILITY TESTINGMETHODS

Usability testing with real participantsis a fundamental usability evaluationmethod [Nielsen 1993; Shneiderman1998]. It provides an evaluator withdirect information about how people usecomputers and what some of the problemsare with the interface being tested. Dur-ing usability testing, participants use thesystem or a prototype to complete a pre-determined set of tasks while the testerrecords the results of the participants’work. The tester then uses these results todetermine how well the interface supports

users’ task completion as well as othermeasures, such as number of errors andtask completion time.

Automation has been used predomi-nantly in two ways within usabilitytesting: automated capture of use dataand automated analysis of these dataaccording to some metrics or a model(referred to as log file analysis in Table I).In rare cases methods support both auto-mated capture and analysis of usage data[Al-Qaimari and McRostie 1999; Uehlingand Wolf 1995].

4.1. Automating Usability Testing Methods:Capture Support

Many usability testing methods requirethe recording of the actions a user makeswhile exercising an interface. This can bedone by an evaluator taking notes whilethe participant uses the system, eitherlive or by repeatedly viewing a videotapeof the session: both are time-consumingactivities. As an alternative, automatedcapture techniques can log user activityautomatically. An important distinctioncan be made between information that iseasy to record but difficult to interpret(e.g., keystrokes) and information thatis meaningful but difficult to automati-cally label, such as task completion. Au-tomated capture approaches vary with re-spect to the granularity of informationcaptured.

Within the usability testing class ofUE, automated capture of usage data issupported by two method types: perfor-mance measurement and remote testing.Both require the instrumentation of auser interface, incorporation into a userinterface management system (UIMS), orcapture at the system level. A UIMS is asoftware library that provides high-levelabstractions for specifying portable andconsistent interface models that are thencompiled into UI implementations [Olsen,Jr. 1992]. Table III provides a synopsis ofautomated capture methods discussed inthe remainder of this section. We discusssupport available for WIMP and Web UIsseparately.



Table III. Synopsis of Automated Capture Support for Usability Testing Methodsa

Method Class: Usability TestingAutomation Type: CaptureMethod Type: Performance Measurement—record usage data during test (7 methods)UE Method UI EffortLog low-level events (Hammontree et al. [1992]) WIMP FLog UIMS events (UsAGE, IDCAT) WIMP FLog system-level events (KALDI) WIMP FLog Web server requests (Scholtz and Laskowski [1998]) Web FLog client-side activities (WebVIP, WET) Web FMethod Type: Remote Testing—tester and user are not colocated (3 methods)UE Method UI EffortEmploy same-time different-place testing (KALDI) WIMP, Web IFEmploy different-time different-place testing (journaled sessions) WIMP, Web IFAnalyze a Web site’s information organization (WebCAT) Web IF

aThe effort level for each method is represented as: minimal (blank), formal (F), informal (I),and model (M).

4.1.1. Automating Usability Testing Methods:Capture Support—WIMP UIs. Performancemeasurement methods record usagedata (e.g., a log of events and timeswhen events occurred) during a usabilitytest. Video recording and event loggingtools [Al-Qaimari and McRostie 1999;Hammontree et al. 1992; Uehling andWolf 1995] are available to automaticallyand accurately align timing data with userinterface events. Some event logging tools(e.g., Hammontree et al. [1992]) recordevents at the keystroke or system level.Recording data at this level produces vo-luminous log files and makes it difficult tomap recorded usage into high-level tasks.

As an alternative, two systems logevents within a UIMS. UsAGE (useraction graphing effort)2 [Uehling andWolf 1995] enables the evaluator toreplay logged events, meaning it canreplicate logged events during playback.This requires that the same study data(databases, documents) be available dur-ing playback as were used during the us-ability test. IDCAT (integrated data cap-ture and analysis tool) [Hammontree et al.1992] logs events and automatically fil-ters and classifies them into meaning-ful actions. This system requires a videorecorder to synchronize taped footage withlogged events. KALDI (keyboard/mouseaction logger and display instrument)

2 This method is not to be confused with the UsAGEanalytical modeling approach discussed in Section 7.

[Al-Qaimari and McRostie 1999] supportsevent logging and screen capturing viaJava and does not require special equip-ment. Both KALDI and UsAGE also sup-port log file analysis (see Section 4.2).

Remote testing methods enable testingbetween a tester and participant who arenot colocated. In this case the evaluator isnot able to observe the participant directly,but can gather data about the processover a computer network. Remote testingmethods are distinguished according towhether a tester observes the participantduring testing. Same-time different-placeand different-time different-place are twomajor remote testing approaches [Hartsonet al. 1996].

In same-time different-place or remote-control testing the tester observes the par-ticipant’s screen through network trans-missions (e.g., using PC Anywhere orTimbuktu) and may be able to hear whatthe participant says via a speaker tele-phone or a microphone affixed to the com-puter. Software makes it possible for thetester to interact with the participant dur-ing the test, which is essential for tech-niques such as the question-asking orthinking-aloud protocols that require suchinteraction.

The tester does not observe the par-ticipant during different-time different-place testing. An example of this approachis the journaled session [Nielsen 1993],in which software guides the participantthrough a testing session and logs the



results. Evaluators can use this approachwith prototypes to get feedback early inthe design process, as well as with re-leased products. In the early stages, eval-uators distribute disks containing a proto-type of a software product and embeddedcode for recording users’ actions. Users ex-periment with the prototype and returnthe disks to evaluators upon completion.It is also possible to embed dialog boxeswithin the prototype in order to recorduser comments or observations during us-age. For released products, evaluators usethis method to capture statistics about thefrequency with which the user has used afeature or the occurrence of events of in-terest (e.g., error messages). This informa-tion is valuable for optimizing frequentlyused features and the overall usability offuture releases.

Remote testing approaches allow forwider testing than traditional methods,but evaluators may experience techni-cal difficulties with hardware and/or soft-ware components (e.g., inability to cor-rectly configure monitoring software ornetwork failures). This can be especiallyproblematic for same-time different-placetesting where the tester needs to ob-serve the participant during testing. Mosttechniques also have restrictions on thetypes of UIs to which they can be ap-plied. This is mainly determined by theunderlying hardware (e.g., PC Anywhereonly operates on PC platforms) [Hartsonet al. 1996]. KALDI, mentioned above,also supports remote testing. Since it wasdeveloped in Java, evaluators can useit for same- and different-time testingof Java applications on a wide range ofcomputing platforms.

4.1.2. Automating Usability Testing Methods:Capture Support—Web UIs. The Web enablesremote testing and performance mea-surement on a much larger scale thanis feasible with WIMP interfaces. Bothsame-time different-place and different-time different-place approaches can beemployed for remote testing of Web UIs.Similar to journaled sessions, Web serversmaintain usage logs and automaticallygenerate a log file entry for each request.

These entries include the IP address ofthe requester, request time, name of therequested Web page, and in some casesthe URL of the referring page (i.e., fromwhere the user came). Server logs can-not record user interactions that occuronly on the client side (e.g., use of within-page anchor links or back button), andthe validity of server log data is ques-tionable due to caching by proxy serversand browsers [Etgen and Cantor 1999;Scholtz and Laskowski 1998]. Server logsmay not reflect usability, especially sincethese logs are often difficult to interpret[Schwartz 2000] and users’ tasks may notbe discernible [Byrne et al. 1999; Schwartz2000].

Client-side logs capture more accurate,comprehensive usage data than server-side logs because they allow all browserevents to be recorded. Such logging mayprovide more insight about usability. Onthe downside, it requires every Web pageto be modified to log usage data, or elseuse of an instrumented browser or specialproxy server.

The NIST WebMetrics tool suite[Scholtz and Laskowski 1998] capturesclient-side usage data. This suite includesWebVIP (web visual instrumentor pro-gram), a visual tool that enables the eval-uator to add event handling code to Webpages. This code automatically records thepage identifier and a timestamp in anASCII file every time a user selects a link.(This package also includes a visualizationtool, VISVIP [Cugini and Scholtz 1999], forviewing logs collected with WebVIP; seeSection 4.2.) Using these client-side data,the evaluator can accurately measuretime spent on tasks or particular pages aswell as study use of the back button anduser clickstreams. Despite its advantagesover server-side logging, WebVIP requiresthe evaluator to make a copy of an entireWeb site, which could lead to invalidpath specifications and other difficultieswith getting the copied site to functionproperly. The evaluator must also addlogging code to each individual link on apage. Since WebVIP only collects data onselected HTML links, it does not recordinteractions with other Web objects, such



as forms. It also does not record usage ofexternal or noninstrumented links.

Similar to WebVIP, the Web event-logging tool (WET) [Etgen and Cantor1999] supports the capture of client-side data, including clicks on Web ob-jects, window resizing, typing in a formobject, and form resetting. WET inter-acts with Microsoft Internet Explorer andNetscape Navigator to record browserevent information, including the type ofevent, a timestamp, and the document-window location. This gives the evaluatora more complete view of the user’s inter-action with a Web interface than WebVIP.WET does not require as much effortto employ as WebVIP, nor does it sufferfrom the same limitations. To use thistool, the evaluator specifies events (e.g.,clicks, changes, loads, and mouseovers)and event-handling functions in a text fileon the Web server; sample files are avail-able to simplify this step. The evaluatormust also add a single call to the text filewithin the<head> tag of each Web page tobe logged. Currently, the log file analysisfor both WebVIP and WET is manual. Fu-ture work has been proposed to automatethis analysis.

The NIST WebMetrics tool suite also in-cludes WebCAT (category analysis tool), atool that aids in Web site category anal-ysis, by a technique sometimes known ascard sorting [Nielsen 1993]. In nonauto-mated card sorting, the evaluator (or ateam of evaluators) writes concepts onpieces of paper, and users group the top-ics into piles. The evaluator manually an-alyzes these groupings to determine a goodcategory structure. WebCAT allows theevaluator to test proposed topic categoriesfor a site via a category matching task(a variation of card sorting where usersassign concepts to predefined categories);this task can be completed remotely byusers. Results are compared to the de-signer’s category structure, and the eval-uator can use the analysis to inform thebest information organization for a site.WebCAT enables wider testing and fasteranalysis than traditional card sorting, andhelps make the technique scale for a largenumber of topic categories.

4.1.3. Automating Usability Testing Methods:Capture Support—Discussion. Automatedcapture methods represent important firststeps toward informing UI improvements:they provide input data for analysis andin the case of remote testing, enablethe evaluator to collect data for a largernumber of users than traditional meth-ods. Without this automation, evaluatorswould have to manually record usagedata, expend considerable time reviewingvideotaped testing sessions or, in the caseof the Web, rely on questionable serverlogs. Methods such as KALDI and WETcapture high-level events that correspondto specific tasks or UI features. KALDIalso supports automated analysis ofcaptured data, discussed below.

Table III summarizes performancemeasurement and remote testing meth-ods discussed in this section. It is difficultto assess the ease of use and learning ofthese approaches, especially IDCAT andremote testing approaches that requireintegration of hardware and softwarecomponents, such as video recorders andlogging software. For Web site logging,WET appears to be easier to use and learnthan WebVIP. It requires the creation ofan event handling file and the additionof a small block of code in each Web pageheader, whereas WebVIP requires theevaluator to add code to every link on allWeb pages. WET also enables the evalua-tor to capture more comprehensive usagedata than WebVIP. WebCAT appearsstraightforward to use and learn for topiccategory analysis. Both remote testing andperformance measurement techniqueshave restrictions on the types of UIs towhich they can be applied. This is mainlydetermined by the underlying hardware(e.g., PC Anywhere only operates on PCplatforms) or UIMS, although KALDIcan potentially be used to evaluate Javaapplications on a wide range of platforms.

4.2. Automating Usability Testing Methods:Analysis Support

Log file analysis methods automate anal-ysis of data captured during formal orinformal interface use. Since Web servers



Table IV. Synopsis of Automated Analysis Support for Usability Testing Methodsa

Method Class: Usability TestingAutomation Type: AnalysisMethod Type: Log File Analysis—analyze usage data (19 methods)UE Method UI EffortUse metrics during log file analysis (DRUM, MIKE UIMS, AMME) WIMP IFUse metrics during log file analysis (Service Metrics, Bacheldor [1999]) Web IFUse pattern matching during log file analysis (MRP) WIMP IFUse task models during log file analysis (IBOT, QUIP, KALDI, UsAGE) WIMP IFUse task models and pattern-matching during log file analysis (EMA, WIMP IFM

USINE, RemUSINE)Visualization of log files (Guzdial et al. [1994]) WIMP IFStatistical analysis or visualization of log files (traffic- and time-based Web IF

analyses, VISVIP, Starfield, and Dome Tree visualizations)aThe effort level for each method is represented as: minimal (blank), formal (F), informal (I),and model (M).

automatically log client requests, log fileanalysis is a heavily used methodologyfor evaluating Web interfaces [Drott 1998;Fuller and de Graaff 1996; Hochheiser andShneiderman 2001; Sullivan 1997]. Oursurvey reveals four general approaches foranalyzing WIMP and Web log files: metric-based, pattern-matching, task-based, andinferential. Table IV provides a synopsisof automated analysis methods discussedin the remainder of this section. We dis-cuss support available for the four generalapproaches separately.

4.2.1. Automating Usability Testing Methods:Analysis Support—Metric-Based Analysis of LogFiles. Metric-based approaches generatequantitative performance measurements.Three examples for WIMP interfaces areDRUM [Macleod and Rengger 1993], theMIKE UIMS [Olsen, Jr. and Halversen1988], and AMME (automatic mentalmodel evaluator) [Rauterberg 1995,1995b; Rauterberg and Aeppili 1995].DRUM enables the evaluator to review avideotape of a usability test and manuallylog starting and ending points for tasks.DRUM processes this log and derivesseveral measurements, including: taskcompletion time, user efficiency (i.e.,effectiveness divided by task completiontime), and productive period (i.e., portionof time the user did not have problems).DRUM also synchronizes the occurrenceof events in the log with videotapedfootage, thus speeding up video analysis.

The MIKE UIMS enables an evaluatorto assess the usability of a UI specified asa model that can be rapidly changed andcompiled into a functional UI. MIKE cap-tures usage data and generates a num-ber of general, physical, logical, and vi-sual metrics, including performance time,command frequency, the number of physi-cal operations required to complete a task,and required changes in the user’s focusof attention on the screen. MIKE also cal-culates these metrics separately for com-mand selection (e.g., traversing a menu,typing a command name, or hitting abutton) and command specification (e.g.,entering arguments for a command) tohelp the evaluator locate specific problemswithin the UI.

AMME employs Petri nets [Petri 1973]to reconstruct and analyze the user’sproblem-solving process. It requires a spe-cially formatted log file and a manuallycreated system description file (i.e., a listof interface states and a state transitionmatrix) in order to generate the Petrinet. It then computes measures of be-havioral complexity (i.e., steps taken toperform tasks), routinization (i.e., repet-itive use of task sequences), and ratios ofthinking versus waiting time. User stud-ies with novices and experts validatedthese quantitative measures and showedbehavioral complexity to correlate nega-tively with learning (i.e., less steps aretaken to solve tasks as a user learns theinterface) [Rauterberg and Aeppili 1995].



Hence, the behavioral complexity measureprovides insight on interface complexity. Itis also possible to simulate the generatedPetri net (see Section 8) to further analyzethe user’s problem-solving and learningprocesses. Multidimensional scaling andMarkov analysis tools are available forcomparing multiple Petri nets (e.g., netsgenerated from novice and expert userlogs). Since AMME processes log files, itcould easily be extended to Web interfaces.

For the Web, site analysis tools devel-oped by Service Metrics [1999] and oth-ers [Bacheldor 1999] allow evaluators topinpoint performance bottlenecks, such asslow server response time, that may havea negative impact on the usability of aWeb site. Service Metrics’ tools, for ex-ample, can collect performance measuresfrom multiple geographical locations un-der various access conditions. In general,performance measurement approaches fo-cus on server and network performance,but provide little insight into the usabilityof the Web site itself.

4.2.2. Automating Usability Testing Methods:Analysis Support—Pattern-Matching Analysis ofLog Files. Pattern-matching approaches,such as MRP (maximum repeating pat-tern) [Siochi and Hix 1991], analyzeuser behavior captured in logs. MRP de-tects and reports repeated user actions(e.g., consecutive invocations of the samecommand and errors) that may indicateusability problems. Studies with MRPshowed the technique to be useful for de-tecting problems with expert users, butadditional data prefiltering was requiredfor detecting problems with novice users.Whether the evaluator performed this pre-filtering or it was automated is unclear inthe literature.

Three evaluation methods employ pat-tern matching in conjunction with taskmodels. We discuss these methods imme-diately below.

4.2.3. Automating Usability Testing Methods:Analysis Support—Task-Based Analysis of LogFiles. Task-based approaches analyze dis-crepancies between the designer’s antici-pation of the user’s task model and what

a user actually does while using the sys-tem. The IBOT system [Zettlemoyer et al.1999] automatically analyzes log files todetect task completion events. The IBOTsystem interacts with Windows operat-ing systems to capture low-level win-dow events (e.g., keyboard and mouse ac-tions) and screen buffer information (i.e.,a screen image that can be processed toautomatically identify widgets). The sys-tem then combines this information intointerface abstractions (e.g., menu selectand menubar search operations). Evalu-ators can use the system to compare userand designer behavior on these tasks andto recognize patterns of inefficient or in-correct behaviors during task completion.Without such a tool, the evaluator has tostudy the log files and do the comparisonmanually. Future work has been proposedto provide critique support.

The QUIP (quantitative user interfaceprofiling) tool [Helfrich and Landay 1999]and KALDI [Al-Qaimari and McRostie1999] (see previous section) provide moreadvanced approaches to task-based, logfile analysis for Java-based UIs. Unlikeother approaches, QUIP aggregates tracesof multiple user interactions and com-pares the task flows of these users to thedesigner’s task flow. QUIP encodes quan-titative time- and trace-based informationinto directed graphs (see Figure 2). For ex-ample, the average time between actionsis indicated by the color of each arrow,and the proportion of users who performeda particular sequence of actions is indi-cated by the width of each arrow. The de-signer’s task flow is indicated by the diag-onal shading in Figure 2. Currently, theevaluator must program the UI to collectthe necessary usage data, and must manu-ally analyze the graphs to identify usabil-ity problems.

KALDI captures usage data and screenshots for Java applications. It also enablesthe evaluator to classify tasks (both man-ually and via automatic filters), compareuser performance on tasks, and play backsynchronized screen shots. It depicts logsgraphically in order to facilitate analysis.

UsAGE [Uehling and Wolf 1995], whichalso supports logging usage data within a



Fig. 2 . QUIP usage profile con-trasting task flows for two usersto the designer’s task flow (di-agonal shading) [Helfrich andLanday 1999]. Each node repre-sents a user action, and arrows in-dicate actions taken by users. Thewidth of arrows denotes the frac-tion of users completing actions,and the color of arrows reflectsthe average time between actions(darker colors correspond to longertimes). Reprinted with permissionof the authors.

UIMS, provides a similar graphical pre-sentation for comparing event logs for ex-pert and novice users. However, graphnodes are labeled with UIMS event names,thus making it difficult to map eventsto specific interface tasks. To mitigatethis shortcoming, UsAGE allows the eval-uator to replay recorded events in theinterface.

Several systems incorporate patternmatching (see discussion above) into theiranalyses. This combination results in themost advanced log file analysis of all ofthe approaches surveyed. These systemsinclude EMA (automatic analysis mecha-nism for the ergonomic evaluation of userinterfaces) [Balbo 1996], USINE (user in-terface evaluator) [Lecerof and Paterno1998], and RemUSINE (remote user in-terface evaluator) [Paterno and Ballardin1999], all discussed below.

EMA uses a manually created dataflowtask model and standard behavior heuris-tics to flag usage patterns that may indi-cate usability problems. EMA extends theMRP approach (repeated command execu-tion) to detect additional patterns, includ-ing immediate task cancellation, shifts indirection during task completion, and dis-crepancies between task completion andthe task model. EMA outputs results inan annotated log file, which the evaluator

must manually inspect to identify usabil-ity problems. Application of this techniqueto the evaluation of ATM (automated tellermachine) usage corresponded with prob-lems identified using standard heuristicevaluation [Balbo 1996].

USINE [Lecerof and Paterno 1998] em-ploys the ConcurTaskTrees [Paterno et al.1997] notation to express temporal re-lationships among UI tasks (e.g., en-abling, disabling, and synchronization).Using this information, USINE looks forprecondition errors (i.e., task sequencesthat violate temporal relationships) andalso reports quantitative metrics (e.g.,task completion time) and informationabout task patterns, missing tasks, anduser preferences reflected in the usagedata. Studies with a graphical interfaceshowed that USINE’s results correspondwith empirical observations and highlightthe source of some usability problems.To use the system, evaluators must cre-ate task models using the ConcurTask-Trees editor as well as a table specify-ing mappings between log entries and thetask model. USINE processes log files andoutputs detailed reports and graphs tohighlight usability problems. RemUSINE[Paterno and Ballardin 1999] is an exten-sion that analyzes multiple log files (typi-cally captured remotely) to enable compar-ison across users.

4.2.4. Automating Usability Testing Methods:Analysis Support—Inferential Analysis of LogFiles. Inferential analysis of Web log filesincludes both statistical and visualiza-tion techniques. Statistical approaches in-clude traffic-based analysis (e.g., pagesper visitor or visitors per page) and time-based analysis (e.g., clickstreams andpage-view durations) [Drott 1998; Fullerand de Graaff 1996; Sullivan 1997; Thengand Marsden 1998]. Some methods re-quire manual preprocessing or filtering ofthe logs before analysis. Furthermore, theevaluator must interpret reported mea-sures in order to identify usability prob-lems. Software tools, such as WebTrends[WebTrends Corporation 2000], facilitateanalysis by presenting results in graphi-cal and report formats.



Statistical analysis is largely inconclu-sive for Web server logs, since they provideonly a partial trace of user behavior andtiming estimates may be skewed bynetwork latencies. Server log files are alsomissing valuable information about whattasks users want to accomplish [Byrneet al. 1999; Schwartz 2000]. Nonethe-less, statistical analysis techniques havebeen useful for improving usability andenable ongoing, cost-effective evaluationthroughout the life of a site [Fuller andde Graaff 1996; Sullivan 1997].

Visualization is also used for inferentialanalysis of WIMP and Web log files [Chiet al. 2000; Cugini and Scholtz 1999; Guz-dial et al. 1994; Hochheiser and Shnei-derman 2001]. It enables evaluators to fil-ter, manipulate, and render log file datain a way that ideally facilitates analysis.Guzdial et al. [1994] propose several tech-niques for analyzing WIMP log files, suchas color-coding patterns and command us-age, tracking screen updates, and trackingmouseclick locations and depth (i.e., num-ber of times the user clicked the mouse inscreen areas). However, there is no discus-sion of how effective these approaches arein supporting analysis.

Starfield visualization [Hochheiser andShneiderman 2001] is one approach thatenables evaluators to interactively exploreWeb server log data in order to gain anunderstanding of human factors issuesrelated to visitation patterns. This ap-proach combines the simultaneous displayof a large number of individual datapoints(e.g., URLs requested vs. time of requests)in an interface that supports zooming, fil-tering, and dynamic querying [Ahlbergand Shneiderman 1994]. Visualizationsprovide a high-level view of usage pat-terns (e.g., usage frequency, correlated ref-erences, bandwidth usage, HTTP errors,and patterns of repeated visits over time)that the evaluator must explore to identifyusability problems. It would be beneficialto employ a statistical analysis approachto study traffic, clickstreams, and pageviews prior to exploring visualizations.

The Dome Tree visualization [Chi et al.2000] provides an insightful representa-tion of simulated (see Section 8) and ac-

tual Web usage captured in server logfiles. This approach maps a Web siteinto a three-dimensional surface repre-senting the hyperlinks (see the top partof Figure 3). The location of links on thesurface is determined by a combinationof content similarity, link usage, and linkstructure of Web pages. The visualizationhighlights the most commonly traversedsubpaths. An evaluator can explore theseusage paths to possibly gain insight aboutthe information “scent” (i.e., common top-ics among Web pages on the path) as de-picted in the bottom window of Figure3. This additional information may helpthe evaluator infer what the informationneeds of site users are, and more impor-tantly, may help infer whether the sitesatisfies those needs. The Dome Tree visu-alization also reports a crude path traver-sal time based on the sizes of pages (i.e.,number of bytes in HTML and image files)along the path. Server log accuracy limitsthe extent to which this approach can suc-cessfully indicate usability problems. Asis the case for Starfield visualization, itwould be beneficial to statistically analyzelog files prior to using this approach.

VISVIP [Cugini and Scholtz 1999] is athree-dimensional tool for visualizing logfiles compiled by WebVIP during usabil-ity testing (see previous section). Figure 4shows VISVIP’s Web site (top graph) andusage path (bottom graph) depictions to besimilar to the Dome Tree visualization ap-proach. VISVIP generates a 2-D layout ofthe site where adjacent nodes are placedcloser together than nonadjacent nodes. Athird dimension reflects timing data as adotted vertical bar at each node; the heightis proportional to the amount of time.VISVIP also provides animation facili-ties for visualizing path traversal. SinceWebVIP logs reflect actual task comple-tion, prior statistical analysis is not nec-essary for VISVIP usage.

4.2.5. Automating Usability Testing Methods:Analysis Support—Discussion. Table IV sum-marizes log file analysis methods dis-cussed in this section. Although thetechniques vary widely on the four as-sessment criteria (effectiveness, ease of



Fig. 3 . Dome Tree visualization [Chi et al. 2000] of a Web site with a usagepath displayed as a series of connected lines across the left side. The bottompart of the figure displays information about the usage path, including anestimated navigation time and information scent (i.e., common keywordsalong the path). Reprinted with permission of the authors.

use, ease of learning, and applicability),all approaches offer substantial benefitsover the alternative—time-consuming,unaided analysis of potentially largeamounts of raw data. Hybrid task-based

pattern-matching techniques like USINEmay be the most effective (i.e., provideclear insight for improving usability viatask analysis); however, they requireadditional effort and learning time over



Fig. 4 . VISVIP visualization [Cugini and Scholtz 1999] of a Web site (top part).The bottom part of the figure displays a usage path (series of directed lines onthe left site) laid over a site. The top and bottom figures are for two differentsites. Reprinted with permission of the authors.



Table V. Synopsis of Automated Capture Support for Inspection Methodsa

Method Class: InspectionAutomation Type: CaptureMethod Type: Cognitive Walkthrough—expert simulates user’s problem solving

(1 method)UE Method UI EffortSoftware assists the expert with documenting a cognitive walkthrough WIMP F


simpler pattern-matching approaches;this additional effort is mainly in thedevelopment of task models. Althoughpattern-matching approaches are easierto use and learn, they only detect problemsfor prespecified usage patterns.

Metric-based approaches in the WIMPdomain have been effective at associ-ating measurements with specific inter-face aspects, such as commands andtasks, which can then be used to iden-tify usability problems. AMME also helpsthe evaluator to understand the user’sproblem-solving process and conduct sim-ulation studies. However, metric-basedapproaches require the evaluator to con-duct more analysis to ascertain the sourceof usability problems than task-based ap-proaches. Metric-based techniques in theWeb domain focus on server and networkperformance, which provides little usabil-ity insight. Similarly, inferential analysisof Web server logs is limited by their accu-racy and may provide inconclusive usabil-ity information.

Most of the techniques surveyed in thissection could be applied to WIMP and WebUIs other than those demonstrated on,with the exception of the MIKE UIMS andUsAGE, which require a WIMP UI to bedeveloped within a special environment.AMME could be employed for both Weband WIMP UIs, provided log files and sys-tem models are available.

5. AUTOMATING INSPECTION METHODS

A usability inspection is an evaluationmethodology whereby an evaluator ex-amines the usability aspects of a UI de-sign with respect to its conformance toa set of guidelines. Guidelines can range

from highly specific prescriptions to broadprinciples. Unlike other UE methods, in-spections rely solely on the evaluator’sjudgment. A large number of detailed us-ability guidelines have been developed forWIMP interfaces [Open Software Foun-dation 1991; Smith and Mosier 1986]and Web interfaces [Comber 1995; De-tweiler and Omanson 1996; Levine 1996;Lynch and Horton 1999; Web AccessibilityInitiative 1999]. Common nonautomatedinspection techniques are heuristic evalu-ation [Nielsen 1993] and cognitive walk-throughs [Lewis et al. 1990].

Designers have historically experienceddifficulties following design guidelines[Borges et al. 1996; de Souza and Bevan1990; Lowgren and Nordqvist 1992; Smith1986]. One study has demonstrated thatdesigners are biased towards aestheti-cally pleasing interfaces, regardless ofefficiency [Sears 1995]. Because designershave difficulty applying design guidelines,automation has been predominately usedwithin the inspection class to objectivelycheck guideline conformance. Softwaretools assist evaluators with guidelinereview by automatically detecting andreporting usability violations and in somecases making suggestions for fixing them[Balbo 1995; Farenc and Palanque 1999].Automated capture, analysis, and critiquesupport is available for the guideline re-view and cognitive walkthrough methodtypes, as described in the remainder ofthis section.

5.1. Automating Inspection Methods:Capture Support

Table V summarizes capture support forinspection methods, namely, a system



Table VI. Synopsis of Automated Analysis Support for Inspection Methodsa

Method Class: InspectionAutomation Type: AnalysisMethod Type: Guideline Review—expert checks guideline conformance (8 methods)UE Method UI EffortUse quantitative screen measures for analysis (AIDE, WIMPParush et al. [1998])Analyze terminology and consistency of UI elements (Sherlock) WIMPAnalyze the structure of Web pages (Rating Game, HyperAT, Gentler) WebUse guidelines for analysis (WebSAT) WebAnalyze the scanning path of a Web page (Design Advisor) Web


developed to assist an evaluator with acognitive walkthrough. During a cognitivewalkthrough, an evaluator attempts tosimulate a user’s problem-solving pro-cess while examining UI tasks. At eachstep of a task, the evaluator assesseswhether a user would succeed or fail tocomplete the step. Hence, the evaluatorproduces extensive documentation dur-ing this analysis. There was an earlyattempt to “automate” cognitive walk-throughs by prompting evaluators withwalkthrough questions and enablingevaluators to record their analyses inHyperCard. Unfortunately, evaluatorsfound this approach too cumbersome andtime consuming to employ [Rieman et al.1991].

5.2. Automating Inspection Methods:Analysis Support

Table VI provides a synopsis of automatedanalysis methods for inspection-based us-ability evaluation, discussed in detail inthe remainder of this section. All of themethods require minimal effort to employ;we denote this with a blank entry in the ef-fort column. We discuss support availablefor WIMP and Web UIs separately.

5.2.1. Automating Inspection Methods: Anal-ysis Support—WIMP UIs. Several quantita-tive measures have been proposed for eval-uating interfaces. Tullis [1983] derivedsize measures (overall density, localdensity, number of groups, size of groups,number of items, and layout complexity).Streveler and Wasserman [1984] proposed

“boxing,” “hot-spot,” and “alignment” anal-ysis techniques. These early techniqueswere designed for alphanumeric displays,whereas more recent techniques evaluateWIMP interfaces. Vanderdonckt and Gillo[1994] proposed five visual techniques(physical composition, association anddissociation, ordering, and photographictechniques), which identified more visualdesign properties than traditional bal-ance, symmetry, and alignment measures.Rauterberg [1996a] proposed and vali-dated four measures (functional feedback,interactive directness, application flexi-bility, and dialog flexibility) to evaluateWIMP UIs. Quantitative measures havebeen incorporated into automated layouttools [Bodart et al. 1994; Kim and Foley1993] as well as several automated anal-ysis tools [Mahajan and Shneiderman1997; Parush et al. 1998; Sears 1995],discussed immediately below.

Parush et al. [1998] developed andvalidated a tool for computing the com-plexity of dialog boxes implemented withMicrosoft Visual Basic. It considerschanges in the size of screen elements, thealignment and grouping of elements, aswell as the utilization of screen space in itscalculations. User studies demonstratedthat tool results can be used to decreasescreen search time and ultimately toimprove screen layout. AIDE (semiauto-mated interface designer and evaluator)[Sears 1995] is a more advanced tool thathelps designers assess and compare dif-ferent design options using quantitativetask-sensitive and task-independent met-rics, including efficiency (i.e., distance of



cursor movement), vertical and horizontalalignment of elements, horizontal andvertical balance, and designer-specifiedconstraints (e.g., position of elements).AIDE also employs an optimization al-gorithm to automatically generate initialUI layouts. Studies with AIDE showed itto provide valuable support for analyzingthe efficiency of a UI and incorporatingtask information into designs.

Sherlock [Mahajan and Shneiderman1997] is another automated analysistool for Windows interfaces. Rather thanassessing ergonomic factors, it focuseson task-independent consistency checking(e.g., same widget placement and labels)within the UI or across multiple UIs; userstudies have shown a 10 to 25% speedupfor consistent interfaces [Mahajan andShneiderman 1997]. Sherlock evaluatesvisual properties of dialog boxes, termi-nology (e.g., identify confusing terms andcheck spelling), as well as button sizesand labels. Sherlock evaluates any Win-dows UI that has been translated into aspecial canonical format file; this file con-tains GUI object descriptions. Currently,there are translators for Microsoft VisualBasic and Microsoft Visual C++ resourcefiles.

5.2.2. Automating Inspection Methods: Anal-ysis Support—Web UIs. The Rating Game[Stein 1997] is an automated analysis toolthat attempts to measure the quality ofa set of Web pages using a set of easilymeasurable features. These include: an in-formation feature (word to link ratio), agraphics feature (number of graphics ona page), a gadgets feature (number of ap-plets, controls, and scripts on a page), andso on. The tool reports these raw measureswithout providing guidance for improvinga Web page.

Two authoring tools from MiddlesexUniversity, HyperAT [Theng and Marsden1998] and Gentler [Thimbleby 1997], per-form a similar structural analysis atthe site level. The goal of the Hypertextauthoring tool (HyperAT) is to supportthe creation of well-structured hyperdoc-uments. It provides a structural analysis

which focuses on verifying that thebreadths and depths within a page and atthe site level fall within thresholds (e.g.,depth is less than three levels). (HyperATalso supports inferential analysis ofserver log files similar to other log fileanalysis techniques; see Section 4.2.)Gentler [Thimbleby 1997] provides sim-ilar structural analysis but focuses onmaintenance of existing sites rather thanthe design of new ones.

The Web static analyzer tool (SAT)[Scholtz and Laskowski 1998], part of theNIST WebMetrics suite of tools, assessesstatic HTML according to a number ofusability guidelines, such as whether allgraphics contain ALT attributes, the aver-age number of words in link text, and theexistence of at least one outgoing link ona page. Currently, WebSAT only processesindividual pages and does not suggest im-provements [Chak 2000]. Future plans forthis tool include adding the ability to in-spect the entire site more holistically inorder to identify potential problems in in-teractions between pages.

Unlike other analysis approaches, theDesign Advisor [Faraday 2000] enablesvisual analysis of Web pages. The tooluses empirical results from eye-trackingstudies designed to assess the atten-tional effects of various elements, suchas animation, images, and highlighting,in multimedia presentations [Faraday andSutcliffe 1998]; these studies found mo-tion, size, images, color, text style, andposition to be scanned in this order. TheDesign Advisor determines and superim-poses a scanning path on a Web page asdepicted in Figure 5. It currently does notprovide suggestions for improving scan-ning paths.

5.2.3. Automating Inspection Methods: Anal-ysis Support—Discussion. Table VI summa-rizes automated analysis methods dis-cussed in this section. All of the WIMPapproaches are highly effective at check-ing for guidelines that can be op-erationalized. These include computingquantitative measures (e.g., the size ofscreen elements, screen space usage, and



Fig. 5 . The Design Advisor [Faraday 2000] superimposes a scanning path on the Webpage. The numbers indicate the order in which elements will be scanned. Reproducedwith permission of the author.

efficiency) and checking consistency (e.g.,same widget size and placement acrossscreens). All of the tools have also beenempirically validated. However, the toolscannot assess UI aspects that cannot beoperationalized, such as whether the la-bels used on elements will be understoodby users. For example, Farenc et al. [1999]show that only 78% of a set of estab-lished ergonomic guidelines could be oper-ationalized in the best case scenario andonly 44% in the worst case. All methodsalso suffer from limited applicability (in-terfaces developed with Microsoft VisualBasic or Microsoft Visual C). The tools ap-pear to be straightforward to learn anduse, provided the UI is developed in theappropriate environment.

The Rating Game, HyperAT, andGentler compute and report a number ofstatistics about a page (e.g., number oflinks, graphics, and words). However, theeffectiveness of these structural analysesis questionable, since the thresholdshave not been empirically validated.

Although there have been some investi-gations into breadth and depth tradeoffsfor the Web [Larson and Czerwinski1998; Zaphiris and Mtei 1997], generalthresholds still remain to be established.Although WebSAT helps designers adhereto good coding practices, these practiceshave not been shown to improve usability.There may be some indirect support forthese methods through research aimed atidentifying aspects that affect Web sitecredibility [Fogg 1999; Fogg et al. 2000],since credibility affects usability andvice versa. This work presents a surveyof over 1,400 Web users as well as anempirical study which indicated thattypographical errors, ads, broken links,and other aspects have an impact oncredibility; some of these aspects can bedetected by automated UE tools, such asWebSAT. All of these approaches are easyto use, learn, and apply to all Web UIs.

The visual analysis supported by theDesign Advisor could help designers im-prove Web page scanning. It requires a



Table VII. Synopsis of Automated Critique Support for Inspection Methodsa

Method Class: InspectionAutomation Type: CritiqueMethod Type: Guideline Review—expert checks guideline conformance (11 methods)UE Method UI EffortUse guidelines for critiquing (KRI/AG, IDA, CHIMES, Ergoval) WIMPUse guidelines for critiquing and modifying a UI (SYNOP) WIMP MCheck HTML syntax (Weblint, Dr. Watson) WebUse guidelines for critiquing (Lift Online, Lift Onsite, WebBobby, WebEval)


special Web browser for use, but is easyto use, learn, and apply to basic Web pages(i.e., pages that do not use scripts, applets,Macromedia Flash, or other non-HTMLtechnology). Heuristics employed by thistool were developed based on empirical re-sults from eye-tracking studies of multi-media presentations, but have not beenempirically validated for Web pages.

We are currently developing a method-ology for deriving Web design guidelinesdirectly from sites that have been assessedby human judges [Ivory et al. 2000, 2001;Ivory 2001]. We have identified a num-ber of Web page measures having to dowith page composition (e.g., number ofwords, links, and images), page format-ting (e.g., emphasized text, text position-ing, and text clusters), and overall pagecharacteristics (e.g., page size and down-load speed). Empirical studies have shownthat we can predict with high accuracywhether a Web page will be rated favor-ably based on key metrics. Future workwill identify profiles of highly rated pagesthat can be used to help designers improveWeb UIs.

5.3. Automating Inspection Methods:Critique Support

Critique systems give designers cleardirections for conforming to violatedguidelines and consequently improvingusability. As mentioned above, followingguidelines is difficult, especially whenthere are a large number of guidelines toconsider. Automated critique approaches,especially ones that modify a UI [Balbo

1995], provide the highest level of supportfor adhering to guidelines.

Table VII provides a synopsis of auto-mated critique methods discussed in theremainder of this section. All but onemethod, SYNOP, require minimal effort toemploy; we denote this with a blank en-try in the effort column. We discuss sup-port available for WIMP and Web UIsseparately.

5.3.1. Automating Inspection Methods: Cri-tique Support—WIMP UIs. The KRI/AG tool(knowledge-based review of user inter-face) [Lowgren and Nordqvist 1992] is anautomated critique system that checksthe guideline conformance of X-WindowUI designs created using the TeleUSEUIMS [Lee 1997]. KRI/AG contains aknowledge base of guidelines and styleguides, including the Smith and Mosierguidelines [1986] and Motif style guides[Open Software Foundation 1991]. It usesthis information to automatically critiquea UI design and generate comments aboutpossible flaws in the design. IDA (userinterface design assistance) [Reiterer1994] also embeds rule-based (i.e., expertsystem) guideline checks within a UIMS.SYNOP [Balbo 1995] is a similar auto-mated critique system that performs arule-based critique of a control systemapplication. SYNOP also modifies the UImodel based on its evaluation. CHIMES(computer-human interaction models)[Jiang et al. 1993] assesses the degree towhich NASA’s space-related critical andhigh-risk interfaces meet human factorsstandards.



Unlike KRI/AG, IDA, SYNOP, andCHIMES, Ergoval [Farenc and Palanque1999] is widely applicable to WIMP UIson Windows platforms. It organizes guide-lines into an object-based framework (i.e.,guidelines that are relevant to each graph-ical object) in order to bridge the gap be-tween the developer’s view of an interfaceand how guidelines are traditionally pre-sented (i.e., checklists). This approach isbeing incorporated into a Petri net envi-ronment [Palanque et al. 1999] to enableguideline checks throughout the develop-ment process.

5.3.2. Automating Inspection Methods: Cri-tique Support—Web UIs. Several automatedcritique tools use guidelines for Website usability checks. The World WideWeb Consortium’s HTML ValidationService [World Wide Web Consortium2000] checks that HTML code conformsto standards. Weblint [Bowers 1996] andDr. Watson [Addy & Associates 2000]also check HTML syntax and in additionverify links. Dr. Watson also computesdownload speed and spell-checks text.

UsableNet’s LIFT Online and LIFT On-site [Usable Net 2000] perform usabilitychecks similar to WebSAT (discussed inSection 5.2.2) as well as checking for use ofstandard and portable link, text, and back-ground colors, the existence of stretchedimages, and other guideline violations.LIFT Online suggests improvements, andLIFT Onsite guides users through mak-ing suggested improvements. According toChak [2000], these two tools provide valu-able guidance for improving Web sites.Bobby [Clark and Dardailler 1999; Cooper1999] is another HTML analysis tool thatchecks Web pages for their accessibility[Web Accessibility Initiative 1999] to peo-ple with disabilities.

Conforming to the guidelines embed-ded in these tools can potentially elimi-nate usability problems that arise due topoor HTML syntax (e.g., missing page el-ements) or guideline violations. As pre-viously discussed, research on Web sitecredibility [Fogg 1999; Fogg et al. 2000]possibly suggests that some of the aspects

assessed by these tools, such as bro-ken links and other errors, may also af-fect usability due to the relationship be-tween usability and credibility. However,Ratner et al. [1996] question the valid-ity of HTML usability guidelines, sincemost HTML guidelines have not been sub-jected to a rigorous development processas established guidelines for WIMP in-terfaces. Analysis of 21 HTML guidelinesshowed little consistency among them,with 75% of recommendations appearingin only one style guide. Furthermore, only20% of HTML-relevant recommendationsfrom established WIMP guidelines existedin the 21 HTML style guides. WebEval[Scapin et al. 2000] is one automated cri-tique approach being developed to addressthis issue. Similar to Ergoval (discussedabove), it provides a framework for ap-plying established WIMP guidelines torelevant HTML components. Even withWebEval, some problems, such as whethertext will be understood by users, are diffi-cult to detect automatically.

5.3.3. Automating Inspection Methods: Cri-tique Support—Discussion. Table VII sum-marizes automated critique methods dis-cussed in this section. All of the WIMPapproaches are highly effective at sug-gesting UI improvements for those guide-lines that can be operationalized. Theseinclude checking for the existence of la-bels for text fields, listing menu optionsin alphabetical order, and setting defaultvalues for input fields. However, they can-not assess UI aspects that cannot be op-erationalized, such as whether the la-bels used on elements will be understoodby users. As previously discussed, Farencet al. [1999] show that only 78% of a setof established ergonomic guidelines couldbe operationalized in the best case sce-nario and only 44% in the worst case.Another drawback of approaches thatare not embedded within a UIMS (e.g.,SYNOP) is that they require consider-able modeling and learning effort on thepart of the evaluator. All methods, ex-cept Ergoval, also suffer from limitedapplicability.



Table VIII. Synopsis of Automated Capture Support for Inquiry Methodsa

Method Class: InquiryAutomation Type: CaptureMethod Type: Questionnaires—user provides answers to specific questions

(2 methods)UE Method UI EffortQuestionnaire embedded within the UI (UPM) WIMP IFHTML forms-based questionnaires (e.g., WAMMI, QUIS, or WIMP, Web IFNetRaker)

aThe effort level for each method is represented as: minimal (blank), formal (F), informal(I), and model (M).

As previously discussed, conforming tothe guidelines embedded in HTML anal-ysis tools can potentially eliminate us-ability problems that arise due to poorHTML syntax (e.g., missing page ele-ments) or guideline violations. However,Ratner et al. [1996] question the validityof HTML usability guidelines, since mosthave not been subjected to a rigorous de-velopment process as established guide-lines for WIMP interfaces and have littleconsistency among them. Brajnik [2000]surveyed 11 automated Web site analy-sis methods, including Bobby and Lift On-line. The survey revealed that these toolsaddress only a sparse set of usability fea-tures, such as download time, presence ofalternative text for images, and validationof HTML and links. Other usability as-pects, such as consistency and informationorganization are unaddressed by existingtools.

All of the Web critique tools are applica-ble to basic HTML pages and appear to beeasy to use and learn. They also enable on-going assessment, which can be extremelybeneficial after making changes.

6. AUTOMATING INQUIRY METHODS

Similar to usability testing approaches,inquiry methods require feedback fromusers and are often employed during us-ability testing. However, the focus is not onstudying specific tasks or measuring per-formance. Rather the goal of these meth-ods is to gather subjective impressions(i.e., preferences or opinions) about vari-ous aspects of a UI. Evaluators also em-ploy inquiry methods, such as surveys,

questionnaires, and interviews, to gathersupplementary data after a system is re-leased; this is useful for improving theinterface for future releases. In addition,evaluators use inquiry methods for needsassessment early in the design process.

Inquiry methods vary based on whetherthe evaluator interacts with a user or agroup of users or whether users reporttheir experiences using questionnaires orusage logs, possibly in conjunction withscreen snapshots. Automation has beenused predominately to capture subjectiveimpressions during formal or informal in-terface use.

6.1. Automating Inquiry Methods:Capture Support

Table VIII provides a synopsis of cap-ture methods developed to assist userswith completing questionnaires. Softwaretools enable the evaluator to collect sub-jective usability data and possibly makeimprovements throughout the life of aninterface. Questionnaires can be embed-ded into a WIMP UI to facilitate the re-sponse capture process. Typically dialogboxes prompt users for subjective inputand process responses (e.g., saves data toa file or emails data to the evaluator).For example, UPM (the user partneringmodule) [Abelow 1993] uses event-driventriggers (e.g., errors or specific commandinvocations) to ask users specific questionsabout their interface usage. This approachallows the evaluator to capture user reac-tions while they are still fresh.

The Web inherently facilitates cap-ture of questionnaire data using forms.



Users are typically presented with anHTML page for entering data, and a pro-gram on the Web server (e.g., a CGIscript) processes responses. Several val-idated questionnaires are available inWeb format, including QUIS (question-naire for user interaction satisfaction)[Harper and Norman 1993] for WIMP in-terfaces and WAMMI (Website analysisand measurement inventory) [Kirakowskiand Claridge 1998] for Web interfaces.NetRaker’s [2000] usability research toolsenable evaluators to create custom HTMLquestionnaires and usability tests via atemplate interface and view a graphicalsummary of results even while studiesare in progress. NetRaker’s tools includethe NetRaker Index (a short usabilityquestionnaire) for continuously gatheringfeedback from users about a Web site.Chak [2000] reports that NetRaker’s toolsare highly effective for gathering directuser feedback, but points out the need toaddress potential irritations caused by theNetRaker Index’s pop-up survey window.

As previously discussed, automatedcapture methods represent an importantfirst step toward informing UI improve-ments. Automation support for inquirymethods makes it possible to collect dataquickly from a larger number of usersthan is typically possible without au-tomation. However, these methods sufferfrom the same limitation of nonautomatedapproaches—they may not clearly indicateusability problems due to the subjectivenature of user responses. Furthermore,they do not support automated analysisor critique of interfaces. The real value ofthese techniques is that they are easy touse and widely applicable.

7. AUTOMATING ANALYTICAL MODELINGMETHODS

Analytical modeling complements tra-ditional evaluation techniques such asusability testing. Given some representa-tion or model of the UI and/or the user,these methods enable the evaluator toinexpensively predict usability. A widerange of modeling techniques has beendeveloped and supports different types of

analyses. de Haan et al. [1992] classifymodeling approaches into the followingfour categories.

—Models for task environment analy-sis: enable the evaluator to assess themapping between the user’s goals andUI tasks (i.e., how the user accomplishesthese goals within the UI). ETIT (ex-ternal internal task mapping) [Moran1983] is one example for evaluating thefunctionality, learnability, and consis-tency of the UI;

—Models to analyze user knowledge: en-able the evaluator to use formal gram-mars to represent and assess knowl-edge required for interface use. AL(Action language) [Reisner 1984] andTAG (task-action grammar) [Payne andGreen 1986] allow the evaluator to com-pare alternative designs and predict dif-ferences in learnability;

—Models of user performance: enablethe evaluator to predict user behav-ior, mainly task completion time. Threemethods discussed in this section,GOMS analysis [John and Kieras 1996],CTA [May and Barnard 1994], and PUM[Young et al. 1989], support perfor-mance prediction; and

—Models of the user interface: enablethe evaluator to represent the UI de-sign at multiple levels of abstraction(e.g., syntactic and semantic levels) andassess this representation. CLG (com-mand language grammar) [Moran 1981]and ETAG (extended task-action gram-mar) [Tauber 1990] are two methods forrepresenting and inspecting designs.

Models that focus on user performance,such as GOMS analysis, typically sup-port quantitative analysis. The otherapproaches typically entail qualitativeanalysis and in some cases, such as TAG,support quantitative analysis as well. Oursurvey only revealed automation supportfor methods that focus on user perfor-mance, including GOMS analysis, CTA,and PUM; this is most likely becauseperformance prediction methods supportquantitative analysis, which is easierto automate.



Table IX. Synopsis of Automated Analysis Support for Analytical Modeling Methodsa

Method Class: Analytical ModelingAutomation Type: AnalysisMethod Type: UIDE Analysis—conduct GOMS analysis within a UIDE (4 methods)UE Method UI EffortGenerate predictions for GOMS task models (QGOMS, CATHCI) WIMP MGenerate GOMS task models and predictions (USAGE, CRITIQUE) WIMP MMethod Type: Cognitive Task Analysis—predict usability problems (1 method)Conduct a cognitive analysis of an interface and generate predictions WIMP M(CTA)Method Type: Programmable User Models—write program that acts as a user

(1 method)UE Method UI EffortProgram architecture to mimic user interaction with an interface (PUM) WIMP M


Automation has been predominatelyused to analyze task completion (e.g., ex-ecution and learning time) within WIMPUIs. Analytical modeling inherently sup-ports automated analysis. Our surveydid not reveal analytical modeling tech-niques to support automated critique.Most analytical modeling and simulationapproaches are based on the model hu-man processor (MHP) proposed by Cardet al. [1993]. GOMS analysis (goals, opera-tors, methods, and selection rules) is one ofthe most widely accepted analytical mod-eling methods based on the MHP [Johnand Kieras 1996]. Other methods based onthe MHP employ simulation and are dis-cussed in Section 8.

7.1. Automating Analytical ModelingMethods: Analysis Support

Table IX provides a synopsis of automatedanalysis methods discussed in the remain-der of this section. The survey did not re-veal analytical modeling methods for eval-uating Web UIs.

7.1.1. Automating Analytical Modeling Meth-ods: Analysis Support—WIMP UIs. The GOMSfamily of analytical methods use a taskstructure consisting of goals, operators,methods, and selection rules. Using thistask structure along with validated timeparameters for each operator, the meth-ods enable predictions of task executionand learning times, typically for error-freeexpert performance. The four approaches

in this family include the original GOMSmethod proposed by Card et al. [1993],(CMN-GOMS) the simpler keystroke-levelmodel (KLM), the natural GOMS lan-guage (NGOMSL), and the critical pathmethod (CPM-GOMS) [John and Kieras1996]. These approaches differ in the taskgranularity modeled (e.g., keystrokes vs. ahigh-level procedure) and in the supportfor alternative task completion methodsand support for single goals versus mul-tiple simultaneous goals.

Two of the major roadblocks to us-ing GOMS have been the tedious taskanalysis and the need to calculate ex-ecution and learning times [Baumeisteret al. 2000; Byrne et al. 1994; Hudsonet al. 1999; Kieras et al. 1995]. Thesewere originally specified and calculatedmanually or with generic tools such asspreadsheets. In some cases, evaluatorsimplemented GOMS models in compu-tational cognitive architectures, such asSoar or EPIC (discussed in Section 8).This approach actually added complex-ity and time to the analysis [Baumeis-ter et al. 2000]. QGOMS (quick and dirtyGOMS) [Beard et al. 1996] and CATHCI(cognitive analysis tool for human com-puter interfaces) [Williams 1993] providesupport for generating quantitative pre-dictions, but still require the evaluatorto construct GOMS models. Baumeisteret al. [2000] studied these approaches andshowed them to be inadequate for GOMSanalysis.



USAGE3 (the UIDE System for semiau-tomated GOMS evaluation) [Byrne et al.1994] and CRITIQUE (the convenient,rapid, interactive tool for integratingquick usability evaluations) [Hudson et al.1999] provide support for automaticallygenerating a GOMS task model and quan-titative predictions for the model. Bothof these tools accomplish this within auser interface development environment(UIDE). GLEAN (GOMS language eval-uation and analysis) [Kieras et al. 1995]is another tool that generates quanti-tative predictions for a given GOMStask model (discussed in more detail inSection 8). These tools reduce the effortrequired to employ GOMS analysis andgenerate predictions that are consistentwith models produced by experts. The ma-jor hindrance to wide application of thesetools is that they operate on limited plat-forms (e.g., Sun machines), model low-level goals (e.g., at the keystroke level forCRITIQUE), do not support multiple taskcompletion methods (even though GOMSwas designed to support this), and rely onan idealized expert user model.

Cognitive task analysis (CTA) [Mayand Barnard 1994] uses a differentmodeling approach from GOMS analysis.GOMS analysis requires the evaluatorto construct a model for each task tobe analyzed. However, CTA requires theevaluator to input an interface descriptionto an underlying theoretical model foranalysis. The theoretical model, an expertsystem based on interacting cognitivesubsystems (ICS, discussed in Section8), generates predictions about perfor-mance and usability problems similarlyto a cognitive walkthrough. The systemprompts the evaluator for interface de-tails from which it generates predictionsand a report detailing the theoreticalbasis of predictions. The authors referto this form of analysis as “supportiveevaluation.”

The programmable user model (PUM)[Young et al. 1989] is an entirely dif-ferent analytical modeling technique. In

3 This is not to be confused with the UsAGE log filecapture and analysis tool discussed in Section 4.

this approach, the designer is required towrite a program that acts as a user usingthe interface; the designer must specifyexplicit sequences of operations for eachtask. Task sequences are then analyzedby an architecture (similar to the CTA ex-pert system) that imposes approximationsof psychological constraints, such as mem-ory limitations. Constraint violations canbe seen as potential usability problems.The designer can alter the interface designto resolve violations, and ideally improvethe implemented UI as well. Once the de-signer successfully programs the architec-ture (i.e., creates a design that adheres tothe psychological constraints), the modelcan then be used to generate quantitativeperformance predictions similar to GOMSanalysis. By making a designer aware ofconsiderations and constraints affectingusability from the user’s perspective, thisapproach provides clear insight into spe-cific problems with a UI.

7.1.2. Automating Analytical Modeling Meth-ods: Analysis Support—Discussion. Table IXsummarizes automated analysis methodsdiscussed in this section. Analytical mod-eling approaches enable the evaluator toproduce relatively inexpensive results toinform design choices. GOMS has beenshown to be applicable to all types ofWIMP UIs and is effective at predictingusability problems. However, these predic-tions are limited to error-free expert per-formance in many cases although earlyaccounts of GOMS considered error correc-tion [Card et al. 1983]. The development ofUSAGE and CRITIQUE has reduced thelearning time and effort required to applyGOMS analysis, but they suffer from limi-tations previously discussed. Tools basedon GOMS may also require empiricalstudies to determine operator parame-ters in cases where these parametershave not been previously validated anddocumented.

Although CTA is an ideal solution for it-erative design, it does not appear to be afully developed methodology. Two demon-stration systems have been developed andeffectively used by a group of practitionersas well as by a group of graduate students



[May and Barnard 1994]. However, someusers experienced difficulty with enteringsystem descriptions, which can be a time-consuming process. After the initial inter-face specification, subsequent analysis iseasier because the demonstration systemsstore interface information. The approachappears to be applicable to all WIMP UIs.It may be possible to apply a more fullydeveloped approach to Web UIs.

PUM is a programming approach, andthus requires considerable effort andlearning time to employ. Although it ap-pears that this technique is applicable toall WIMP UIs, its effectiveness is not dis-cussed in detail in the literature.

Analytical modeling of Web UIs lagsfar behind efforts for WIMP interfaces.Many Web authoring tools, such asMicrosoft FrontPage and MacromediaDreamweaver, provide limited support forusability evaluation in the design phase(e.g., predict download time and checkHTML syntax). This addresses only asmall fraction of usability problems. Whileanalytical modeling techniques are poten-tially beneficial, our survey did not un-cover any approaches that address thisgap in Web site evaluation. Approachessuch as GOMS analysis will not map aswell to the Web domain, because it is diffi-cult to predict how a user will accomplishthe goals in a task hierarchy given thatthere are many different ways to navigatea typical site. Another problem is GOMS’reliance on an expert user model (at leastin the automated approaches), which doesnot fit the diverse user community of theWeb. Hence, new analytical modeling ap-proaches, such as a variation of CTA,are required to evaluate the usability ofWeb sites.

8. AUTOMATING SIMULATION METHODS

Simulation complements traditional UEmethods and, like analytical modeling,can be viewed as inherently supportingautomated analysis. Using models of theuser and/or the interface design, theseapproaches simulate the user interactingwith the interface and report the results

of this interaction, in the form of perfor-mance measures and interface operations,for instance. Evaluators can run simu-lations with different parameters in or-der to study various UI design tradeoffsand thus make more informed decisionsabout UI implementation. Simulation isalso used to automatically generate syn-thetic usage data for analysis with log fileanalysis techniques [Chi et al. 2000] orevent playback in a UI [Kasik and George1996]. Thus simulation can also be viewedas supporting automated capture to somedegree.

8.1. Automating Simulation Methods:Capture Support

Table X provides a synopsis of the twoautomated capture methods discussed inthis section. Kasik and George [1996] de-veloped an automated technique for gen-erating and capturing usage data; thesedata could then be used for driving toolsthat replay events (such as executing alog file) within Motif-based UIs. The goalof this work is to use a small number ofinput parameters to inexpensively gener-ate a large number of usage traces (or testscripts) representing novice users. Theevaluator can then use these traces to findweak spots, failures, and other usabilityproblems.

To create novice usage traces, thedesigner initially produces a trace repre-senting an expert using the UI; a script-ing language is available to produce thistrace. The designer can then insert devia-tion commands at different points withinthe expert trace. During trace execution,a genetic algorithm determines user be-havior at deviation points, and in effectsimulates a novice user learning by exper-imentation. Genetic algorithms considerpast history in generating future randomnumbers; this enables the emulation ofuser learning. Altering key features of thegenetic algorithm enables the designer tosimulate other user models. Although cur-rently not supported by this tool, tradi-tional random number generation can alsobe employed to explore the outer limits of a



Table X. Synopsis of Automated Capture Support for Simulation Methodsa

Method Class: SimulationAutomation Type: CaptureMethod Type: Genetic Algorithm Modeling—mimic novice user interaction

(1 method)UE Method UI EffortGenerate and capture usage traces representing novice users WIMP(Kasik and George [1996])Method Type: Information Scent Modeling—mimic Web site navigation (1 method)UE Method UI EffortGenerate and capture navigation paths for information-seeking tasks Web M(Chi et al. [2000])


UI, for example, by simulating completelyrandom behavior.

Chi et al. [2000] developed a similar ap-proach for generating and capturing navi-gation paths for Web UIs. This approachcreates a model of an existing site thatembeds information about the similarityof content among pages, server log data,and linking structure. The evaluator spec-ifies starting points in the site and in-formation needs (i.e., specified pages) asinput to the simulator. The simulationmodels a number of agents (i.e., hypotheti-cal users) traversing the links and contentof the site model. At each page, the modelconsiders information “scent” (i.e., com-mon keywords between an agent’s goaland content on linked pages) in mak-ing navigation decisions. Navigation deci-sions are controlled probabilistically suchthat most agents traverse higher-scentlinks (i.e., closest match to informationgoal) and some agents traverse lower-scent links. Simulated agents stop whenthey reach the specified pages or afteran arbitrary amount of effort (e.g., maxi-mum number of links or browsing time).The simulator records navigation pathsand reports the proportion of agents thatreached specified pages.

The authors use these usage pathsas input to the Dome Tree visualizationmethodology, an inferential log file analy-sis approach discussed in Section 4. Theauthors compare actual and simulatednavigation paths for Xerox’s corporate siteand discover a close match when scent

is “clearly visible” (meaning links are notembedded in long text passages or ob-structed by images). Since the site modeldoes not consider actual page elements,the simulator cannot account for the im-pact of various page aspects, such as theamount of text or reading complexity, onnavigation choices. Hence, this approachmay enable only crude approximationsof user behavior for sites with complexpages.

8.1.1. Automating Simulation Methods: Cap-ture Support—Discussion. Table X summa-rizes automated capture methods dis-cussed in this section. Without thesetechniques, the evaluator must anticipateall possible usage scenarios or rely onformal or informal interface use to gen-erate usage traces. Formal and informaluse limit UI coverage to a small numberof tasks or to UI features that are em-ployed in regular use. Automated tech-niques, such as the genetic algorithm ap-proach, enable the evaluator to producea larger number of usage scenarios andwiden UI coverage with minimal effort.

The system developed by Kasik andGeorge [1976] appears to be relativelystraightforward to use, since it interactsdirectly with a running application anddoes not require modeling. Interactionwith the running application also ensuresthat generated usage traces are plausible.Experiments demonstrated that it is pos-sible to generate a large number of us-age traces within an hour. However, an



Table XI. Synopsis of Automated Analysis Support for Simulation Methodsa

Method Class: SimulationAutomation Type: AnalysisMethod Type: Petri Net Modeling—mimic user interaction from usage data

(1 method)UE Method UI EffortConstruct Petri nets from usage data and analyze problem solving WIMP IF(AMME)Method Type: Information Processor Modeling—mimic user interaction (9 methods)UE Method UI EffortEmploy a computational cognitive architecture for UI analysis WIMP M(ACT-R, COGNET, EPIC, HOS, Soar, CCT, ICS, GLEAN)Employ a GOMS-like model to analyze navigation (Site Profile) Web M


evaluator must manually analyze the ex-ecution of each trace in order to iden-tify problems. The authors propose futurework to automatically verify that a traceproduces the correct result. The evaluatormust also program an expert user trace,which could make the system difficult touse and learn. Currently, this tool is onlyapplicable to Motif-based UIs.

The approach developed by Chi et al.[2000] is applicable to all Web UIs. It alsoappears to be straightforward to use andlearn, since software produces the Website model automatically. The evaluatormust manually interpret simulation re-sults; however, analysis could be facili-tated with the Dome Tree visualizationtool.

8.2. Automating Simulation Methods:Analysis Support

Table XI provides a synopsis of the au-tomated analysis methods discussed inthe remainder of this section. We con-sider methods for WIMP and Web UIsseparately.

8.2.1. Automating Simulation Methods: Analy-sis Support—WIMP UIs. AMME [Rauterbergand Aeppili 1995] (see Section 4.1) is theonly surveyed approach that constructsa WIMP simulation model (Petri net) di-rectly from usage data. Other methodsare based on a model similar to the MHP

and require the evaluator to conduct atask analysis (and subsequently validateit with empirical data) in order to de-velop a simulator. Hence, AMME is moreaccurate and flexible (i.e., task and userindependent), and simulates more detail(e.g., error performance and preferred tasksequences). AMME simulates learning,user decisions, and task completion andoutputs a measure of behavior complex-ity. The behavior complexity measure hasbeen shown to correlate negatively withlearning and interface complexity. Studieshave also validated the accuracy of gener-ated models with usage data [Rauterberg1995]. AMME should be applicable to Webinterfaces as well, since it constructs mod-els from log files. Despite its advantages,AMME still requires formal interface useto generate log files for simulation studies.

The remaining WIMP simulation meth-ods are based on sophisticated computa-tional cognitive architectures, theoreticalmodels of user behavior, similar to theMHP previously discussed. Unlike analyt-ical modeling approaches, these methodsattempt to approximate user behavior asaccurately as possible. For example, thesimulator may track the user’s memorycontents, interface state, and the user’shand movements during execution. Thisenables the simulator to report a detailedtrace of the simulation run. Some sim-ulation methods, such as CCT [Kierasand Polson 1985] (discussed below), canalso generate predictions statically (i.e.,



without being executed) similarly to an-alytical modeling methods.

Pew and Mavor [1998] provide adetailed discussion of computationalcognitive architectures and an overviewof many approaches, including fivethat we discuss: ACT-R (adaptive con-trol of thought) [Anderson 1990, 1993],COGNET (cognition as a network of tasks)[Zachary et al. 1996], EPIC (executive-process interactive control) [Kieras et al.1997], HOS (human operator simulator)[Glenn et al. 1992], and Soar [Laird andRosenbloom 1996; Polk and Rosenbloom1994]. Here, we also consider CCT (cog-nitive complexity theory) [Kieras andPolson 1985], ICS (interacting cognitivesubsystems) [Barnard 1987; Barnardand Teasdale 1991], and GLEAN (GOMSlanguage evaluation and analysis) [Kieraset al. 1995]. Rather than describe eachmethod individually, we summarize themajor characteristics of these simulationmethods in Table XII and discuss thembelow.

Modeled Tasks. The models we sur-veyed simulate the following threetypes of tasks: a user performing cog-nitive tasks (e.g., problem-solving andlearning: COGNET, ACT-R, Soar, ICS);a user immersed in a human-machinesystem (e.g., an aircraft or tank: HOS);and a user interacting with a typical UI(EPIC, GLEAN, CCT).

Modeled Components. Some simulationsfocus solely on cognitive processing(ACT-R, COGNET) whereas others in-corporate perceptual and motor pro-cessing as well (EPIC, ICS, HOS, Soar,GLEAN, CCT).

Component Processing. Task execution ismodeled as serial processing (ACT-R,GLEAN, CCT), parallel processing(EPIC, ICS, Soar), or semiparallel pro-cessing (serial processing with rapid at-tention switching among the modeledcomponents, giving the appearance ofparallel processing: COGNET, HOS).

Model Representation. To represent theunderlying user or system, simulation

methods use task hierarchies (as ina GOMS task structure: HOS, CCT),production rules (CCT, ACT-R, EPIC,Soar, ICS), or declarative/proceduralprograms (GLEAN, COGNET). CCTuses both a task hierarchy and produc-tion rules to represent the user and sys-tem models, respectively.

Predictions. The surveyed methods re-turn a number of simulation results, in-cluding predictions of task performance(EPIC, CCT, COGNET, GLEAN, HOS,Soar, ACT-R), memory load (ICS, CCT),learning (ACT-R, Soar, ICS, GLEAN,CCT), or behavior predictions suchas action traces (ACT-R, COGNET,EPIC).

These methods vary widely in their abil-ity to illustrate usability problems. Theireffectiveness is largely determined by thecharacteristics discussed (modeled tasks,modeled components, component process-ing, model representation, and predic-tions). Methods that are potentially themost effective at illustrating usabilityproblems do so by modeling UI interac-tion and all components (perception, cog-nition, and motor) in parallel, employproduction rules, and report on task per-formance, memory load, learning, andsimulated user behavior. Such methodswould enable the most flexibility and clos-est approximation of actual user behavior.The use of production rules is importantin this methodology, because it relaxesthe requirement for an explicit task hier-archy, thus allowing for the modeling ofmore dynamic behavior, such as Web sitenavigation.

EPIC is the only simulation analy-sis method that embodies most of theseideal characteristics. It employs produc-tion rules and models UI interactionand all components (perception, cognition,and motor) processing in parallel. It re-ports task performance and simulateduser behavior, but does not report memoryload and learning estimates. Studies withEPIC demonstrated that predictions fortelephone operator and menu searchingtasks closely match observed data. EPIC



Table XII. Characteristics of WIMP Simulation Methods Based on a Variation of the MHP

Parameter UE MethodsModeled Tasksproblem-solving and/or learning COGNET, ACT-R, Soar, ICShuman-machine system HOSUI interaction EPIC, GLEAN, CCT

Modeled Componentscognition, ACT-R, COGNETperception, cognition, & motor EPIC, ICS, HOS, Soar, GLEAN, CCT

Component Processingserial ACT-R, GLEAN, CCTsemiparallel COGNET, HOSparallel EPIC, ICS, Soar

Model Representationtask hierarchy HOS, CCTproduction rules CCT, ACT-R, EPIC, Soar, ICSprogram GLEAN, COGNET

Predictionstask performance EPIC, CCT, COGNET, GLEAN, HOS, Soar, ACT-Rmemory load ICS, CCTlearning ACT-R, Soar, ICS, GLEAN, CCTbehavior ACT-R, COGNET, EPIC

and all of the other methods require con-siderable learning time and effort to em-ploy. They are also applicable to a widerange of WIMP UIs.

8.2.2. Automating Simulation Methods: Analy-sis Support—Web UIs. Our survey revealedonly one simulation approach for analysisof Web interfaces, WebCriteria’s Site Pro-file [Web Criteria 1999]. Unlike the othersimulation approaches, it requires an im-plemented interface for evaluation. SiteProfile performs analysis in four phases:gather, model, analyze, and report. Dur-ing the gather phase, a spider traversesa site (200 to 600 unique pages) to col-lect Web site data. These data are thenused to construct a nodes-and-links modelof the site. For the analysis phase, it usesan idealistic Web user model (called Max[Lynch et al. 1999]) to simulate a user’sinformation-seeking behavior; this modelis based on prior research with GOMSanalysis. Given a starting point in thesite, a path, and a target, Max “follows”the path from the starting point to thetarget and logs measurement data. Thesemeasurements are used to compute an ac-cessibility metric, which is then used togenerate a report. This approach can be

used to compare Web sites, provided thatan appropriate navigation path is suppliedfor each.

The usefulness of this approach is ques-tionable, since currently it only com-putes accessibility (navigation time) forthe shortest path between specified startand destination pages using a singleuser model. Other measurements, suchas freshness and page composition, alsohave questionable value in improving theWeb site. Brajnik [2000] showed SiteProfile to support only a small fractionof the analysis supported by guidelinereview methods, such as WebSAT andBobby (discussed in Section 5). Chak[2000] also cautions that the accessibilitymeasure should be used as an initialbenchmark, not a highly accurate approx-imation. Site Profile does not entail anylearning time or effort on the part of theevaluator, since WebCriteria performs theanalysis. The method is applicable to allWeb UIs.

8.2.3. Automating Simulation Methods: Anal-ysis Support—Discussion. Table XI sum-marizes automated analysis methodsdiscussed in this section. Unlike mostevaluation approaches, simulation can be



Table XIII. Synopsis of Automated Analysis Support for Usability Testing Methodsa

Method Class: Usability TestingAutomation Type: AnalysisMethod Type: Log File Analysis—analyze usage data (19 methods)UE Method UI EffortUse metrics during log file analysis (DRUM, MIKE, UIMS, AMME) WIMP IFUse metrics during log file analysis (Service Metrics, Bacheldor [1999]) Web IFUse pattern-matching during log file analysis (MRP) WIMP IFUse task models during log file analysis (IBOT, QUIP, KALDI, UsAGE) WIMP IFUse task models and pattern-matching during log file analysis (EMA, WIMP IFMUSINE, RemUSINE)Visualization of log files (Guzdial et al. [1994]) WIMP IFStatistical analysis or visualization of log files (traffic- and time-based Web IFanalyses, VISVIP, Starfield and Dome Tree visualizations) Web IF

aThe effort level for each method is represented as: minimal (blank), formal (F), informal (I),and model (M). This is a repetition of Table IV.

employed prior to UI implementation inmost cases (although AMME and WebCri-teria’s Site Profile are exceptions to this).Hence, simulation enables alternativedesigns to be compared and optimizedbefore implementation.

It is difficult to assess the effectivenessof simulation methods, although therehave been reports that show EPIC [Kieraset al. 1997] and GLEAN [Baumeister et al.2000] to be effective. AMME appears tobe the most effective method, since itis based on actual usage. AMME alsoenables ongoing assessment and couldbe widely used for WIMP and Web in-terfaces, provided log files and systemmodels are available. EPIC is the onlymethod based on the MHP that em-bodies the ideal simulator characteris-tics previously discussed. GLEAN is ac-tually based on EPIC, so it has similarproperties.

In general, simulation methods aremore difficult to use and learn thanother evaluation methods, because theyrequire constructing or manipulating com-plex models as well as understanding thetheory behind a simulation approach. Ap-proaches based on the MHP are widelyapplicable to all WIMP UIs. Approachesthat use production rules, such as EPIC,CCT, and Soar, could possibly be appliedto Web UIs where task sequences are notas clearly defined as WIMP UIs. Soar hasactually been adapted to model browsing

tasks similar to Web browsing [Peck andJohn 1992].

9. EXPANDING EXISTING APPROACHES TOAUTOMATING USABILITY EVALUATIONMETHODS

Automated usability evaluation methodshave many potential benefits, includ-ing reducing the costs of nonautomatedmethods, aiding in comparisons betweenalternative designs, and improving con-sistency in evaluation results. We havestudied numerous methods that supportautomation. Based on the methods sur-veyed, it is our opinion that research tofurther develop log file analysis, guide-line review, analytical modeling, and sim-ulation techniques could result in sev-eral promising automated techniques asdiscussed in more detail below. Ivory[2001] provides a more detailed dis-cussion of techniques that merit fur-ther research, including approaches basedon performance evaluation of computersystems.

9.1. Expanding Log File AnalysisApproaches

Our survey showed log file analysis to bea viable methodology for automated anal-ysis of usage data. Table XIII summarizescurrent approaches to log file analysis. We



propose the following three ways to ex-pand and improve on these approaches.

Generating synthetic usage data foranalysis. The main limitation of log fileanalysis is that it still requires formalor informal interface use to employ. Oneway to expand the use and benefits of thismethodology is to leverage a small amountof test data to generate a larger set of plau-sible usage data. This is even more impor-tant for Web interfaces, since server logsdo not capture a complete record of userinteractions. We discussed two simulationapproaches, one using a genetic algorithm[Kasik and George 1996] and the other us-ing information scent modeling [Chi et al.2000] (see Section 8.1), that automaticallygenerate plausible usage data. The geneticalgorithm approach determines user be-havior during deviation points in an ex-pert user script, and the information scentmodel selects navigation paths by consid-ering word overlap between links and Webpages. Both of these approaches generateplausible usage traces without formal orinformal interface use. These techniquesalso provide valuable insight on how toleverage real usage data from usabilitytests or informal use. For example, realdata could also serve as input scripts forgenetic algorithms; the evaluator couldadd deviation points to these.

Using log files for comparing UIs. Realand simulated usage data could alsobe used to evaluate comparable WIMPUIs, such as word processors and imageeditors. Task sequences could comprisea usability benchmark (i.e., a programfor measuring UI performance); this issimilar to GOMS analysis of compara-ble task models. After mapping tasksequences into specific UI operations ineach interface, the benchmark could beexecuted within each UI to collect mea-surements. Representing this benchmarkas a log file of some form would enable thelog file to be executed within a UI by re-play tools, such as QC/Replay [Centerline1999] for X-Windows, UsAGE [Uehlingand Wolf 1995] for replaying eventswithin a UIMS (discussed in Section 4),or WinRunner [Mercury Interactive 2000]

for a wide range of applications (e.g.,Java and Oracle applications). This isa promising open area of research forevaluating comparable WIMP UIs.

Augmenting task-based pattern-mat-ching approaches with guidelines in orderto support automated critique. Given awider sampling of usage data, using taskmodels and pattern-matching during logfile analysis is a promising research areato pursue. Task-based approaches that fol-low the USINE model in particular (i.e.,compare a task model expressed in termsof temporal relationships to usage traces)provide the most support among the meth-ods surveyed. USINE outputs informa-tion to help the evaluator understanduser behavior, preferences, and errors. Al-though the authors claim that this ap-proach works well for WIMP UIs, it needsto be adapted to work for Web UIs wheretasks may not be clearly defined. In ad-dition, since USINE already reports sub-stantial analysis data, these data could becompared to usability guidelines in orderto support automated critique.

9.2. Expanding Guideline ReviewApproaches

Several guideline review methods foranalysis of WIMP interfaces (seeTable XIV) could be augmented withguidelines to support automated critique.For example, AIDE [Sears 1995] (dis-cussed in Section 5) provides the mostsupport for evaluating UI designs. Itcomputes a number of quantitative mea-sures and also generates initial interfacelayouts. Guidelines, such as thresholdsfor quantitative measures, could alsobe incorporated into AIDE analysis tosupport automated critique.

Although there are several guideline re-view methods for analyzing and critiquingWeb UIs (see Tables XIV and XV), exist-ing approaches only cover a small fractionof usability aspects [Brajnik 2000] andhave not been empirically validated. Weare developing a methodology for derivingWeb design guidelines directly from sitesthat have been assessed by human judges



Table XIV. Synopsis of Automated Analysis Support for Inspection Methodsa

Method Class: InspectionAutomation Type: AnalysisMethod Type: Guideline Review—expert checks guideline conformance (8 methods)UE Method UI EffortUse quantitative screen measures for analysis (AIDE, WIMPParush et al. [1998])Analyze terminology and consistency of UI elements (Sherlock) WIMPAnalyze the structure of Web pages (Rating Game, HyperAT, Gentler) WebUse guidelines for analysis (WebSAT) WebAnalyze the scanning path of a Web page (Design Advisor) Web

aThe effort level for each method is represented as: minimal (blank), formal (F), informal (I),and model (M). This is a repetition of Table VI.

Table XV. Synopsis of Automated Critique Support for Inspection Methodsa

Method Class: InspectionAutomation Type: CritiqueMethod Type: Guideline Review—expert checks guideline conformance (11 methods)UE Method UI EffortUse guidelines for critiquing (KRI/AG, IDA, CHIMES, Ergoval) WIMPUse guidelines for critiquing and modifying a UI (SYNOP) WIMP MCheck HTML syntax (Weblint, Dr. Watson) WebUse guidelines for critiquing (Lift Online, Lift Onsite, WebBobby, WebEval)

aThe effort level for each method is represented as: minimal (blank), formal (F), informal (I),and model (M). This is a repetition of Table VII.

[Ivory et al. 2000, 2001; Ivory 2001]. Wehave identified a number of Web page mea-sures having to do with page composition(e.g., number of words, links, and images),page formatting (e.g., emphasized text,text positioning, and text clusters), andoverall page characteristics (e.g., page sizeand download speed). Empirical studieshave shown that we can predict with highaccuracy whether a Web page will be ratedfavorably based on key metrics. Futurework will identify profiles of highly ratedpages that can be used to help designersimprove Web UIs.

9.3. Expanding Analytical ModelingApproaches

Our survey showed that evaluation withina user interface development environ-ment (UIDE) is a promising approachfor automated analysis via analyticalmodeling. Table XVI summarizes currentapproaches to analytical modeling. Wepropose to augment UIDE analysis meth-

ods, such as CRITIQUE and GLEAN, withguidelines to support automated critique.Guidelines, such as thresholds for learn-ing or executing certain types of tasks,could assist the designer with interpret-ing prediction results and improving UIdesigns. Evaluation within a UIDE shouldalso make it possible to automatically op-timize UI designs based on guidelines.

Although UIDE analysis is promising,it is not widely used in practice. This maybe due to the fact that most tools are re-search systems and have not been incorpo-rated into popular commercial tools. Thisis unfortunate since incorporating ana-lytical modeling and possibly simulationmethods within a UIDE should mitigatesome barriers to their use, such as beingtoo complex and time consuming to em-ploy [Bellotti 1988]. Applying such anal-ysis approaches outside these user inter-face development environments is an openresearch problem.

Cognitive task analysis [May andBarnard 1994] provides some insight for



Table XVI. Synopsis of Automated Analysis Support for Analytical Modeling Methodsa

Method Class: Analytical ModelingAutomation Type: AnalysisMethod Type: UIDE Analysis—conduct GOMS analysis within a UIDE (4 methods)UE Method UI EffortGenerate predictions for GOMS task models (QGOMS, CATHCI) WIMP MGenerate GOMS task models and predictions (USAGE, CRITIQUE) WIMP MMethod Type: Cognitive Task Analysis—predict usability problems (1 method)Conduct a cognitive analysis of an interface and generate predictions WIMP M(CTA)Method Type: Programmable User Models—rite program that acts as a user

(1 method)UE Method UI EffortProgram architecture to mimic user interaction with an interface (PUM) WIMP M

aThe effort level for each method is represented as: minimal (blank), formal (F), informal (I),and model (M). This is a repetition of Table IX.

Table XVII. Synopsis of Automated Analysis Support for Simulation Methodsa

Method Class: SimulationAutomation Type: AnalysisMethod Type: Petri Net Modeling—imic user interaction from usage data

(1 method)UE Method UI EffortConstruct Petri nets from usage data and analyze problem solving WIMP IF(AMME)Method Type: Information Processor Modeling—mimic user interaction (9 methods)UE Method UI EffortEmploy a computational cognitive architecture for UI analysis (ACT-R, WIMP MCOGNET, EPIC, HOS, Soar, CCT, ICS, GLEAN)Employ a GOMS-like model to analyze navigation (Site Profile) Web M

aThe effort level for each method is represented as: minimal (blank), formal (F), informal (I),and model (M). This is a repetition of Table XI.

analyzing UIs outside a UIDE. Further-more, CTA is a promising approach forautomated analysis, provided more effortis spent to fully develop this methodology.This approach is consistent with analyti-cal modeling techniques employed outsideHCI, such as in the performance evalua-tion of computer systems [Ivory 2001; Jain1991]; this is because with CTA the evalu-ator provides UI parameters to an under-lying model for analysis versus developinga new model to assess each UI. However,one of the drawbacks of CTA is the need todescribe the interface to the system. In-tegrating this approach into a UIDE orUIMS should make this approach moretenable.

As previously discussed, analyticalmodeling approaches for Web UIs still

remain to be developed. It may not be pos-sible to develop new approaches using aparadigm that requires explicit task hier-archies. However, a variation of CTA maybe appropriate for Web UIs.

9.4. Expanding Simulation Approaches

Table XVII summarizes current ap-proaches to simulation analysis. Oursurvey showed that existing simulationsbased on a human information processormodel have widely different uses (e.g.,modeling a user interacting with a UI orsolving a problem). Thus, it is difficultto draw concrete conclusions about theeffectiveness of these approaches. Simu-lation in general is a promising researcharea to pursue for automated analysis,



especially for evaluating alternativedesigns.

We propose using several simulationtechniques employed in the performanceanalysis of computer systems, in particu-lar, trace-driven discrete-event simulationand Monte Carlo simulation [Ivory 2001;Jain 1991], to enable designers to performwhat-if analyses with UIs. Trace-drivendiscrete-event simulations employ real us-age data to model a system as it evolvesover time. Analysts use this approach tosimulate many aspects of computer sys-tems, such as the processing subsystem,operating system, and various resourcescheduling algorithms. In the user in-terface field, all surveyed approachesuse discrete-event simulation. However,AMME constructs simulation models di-rectly from logged usage, which is a formof trace-driven discrete-event simulation.Similarly, other simulators could be al-tered to process log files as input insteadof explicit task or user models, potentiallyproducing more realistic and accuratesimulations.

Monte Carlo simulations enable an eval-uator to model a system probabilisti-cally (i.e., sampling from a probabilitydistribution is used to determine whatevent occurs next). Monte Carlo simula-tion could contribute substantially to au-tomated analysis by relaxing the needfor explicit task hierarchies or user mod-els, although such models could still beemployed. Most simulations in this do-main rely on a single user model, typi-cally an expert user. Monte Carlo simu-lation would enable designers to performwhat-if analysis and study design alter-natives with many user models. The ap-proach employed by Chi et al. [2000] tosimulate Web site navigation is a close ap-proximation of Monte Carlo simulation.

10. CONCLUSIONS

In this article we provided an overview ofusability evaluation automation and pre-sented a taxonomy for comparing variousmethods. We also presented an extensive

survey of the use of automation in WIMPand Web interface evaluation, finding thatautomation is used in only 33% of meth-ods surveyed. Of all the surveyed meth-ods, only 29% are free from requirementsof formal or informal interface use. Allapproaches that do not require formal orinformal interface use, with the exceptionof guideline review, are based on analyti-cal modeling or simulation.

It is important to keep in mind that au-tomation of usability evaluation does notcapture important qualitative and sub-jective information (such as user prefer-ences and misconceptions) that can onlybe unveiled via usability testing, heuris-tic evaluation, and other standard in-quiry methods. Nevertheless, simulationand analytical modeling should be usefulfor helping designers choose among designalternatives before committing to expen-sive development costs.

Furthermore, evaluators could use au-tomation in tandem with what are usuallynonautomated methods, such as heuris-tic evaluation and usability testing. Forexample, an evaluator doing a heuris-tic evaluation could observe automaticallygenerated usage traces executing withina UI.

Adding automation to usability eval-uation has many potential benefits, in-cluding reducing the costs of nonauto-mated methods, aiding in comparisonsbetween alternative designs, and im-proving consistency in usability evalu-ation. Research to further develop an-alytical modeling, simulation, guidelinereview, and log file analysis techniquescould result in new, effective automatedtechniques.

APPENDIX

A. AUTOMATION CHARACTERISTICS OFWIMP AND WEB INTERFACES

The following tables depict automationcharacteristics for WIMP and Web inter-faces separately. We combined this infor-mation in Table I.



Table XVIII. Automation Support for 75 WIMP UE Methodsa


TestingThinking-Aloud Protocol F (1)Question-Asking Protocol F (1)Shadowing Method F (1)Coaching Method F (1)Teaching Method F (1)Codiscovery Learning F (1)Performance Measurement F (1) F (4)Log File Analysis IFM (10)∗Retrospective Testing F (1)Remote Testing IF (2)

InspectionGuideline Review IF (2) (3) M (5)†Cognitive Walkthrough IF (2) F (1)Pluralistic Walkthrough IF (1)Heuristic Evaluation IF (1)Perspective-Based Inspection IF (1)Feature Inspection IF (1)Formal Usability Inspection F (1)Consistency Inspection IF (1)Standards Inspection IF (1)


Analytical ModelingGOMS Analysis M (4) M (2)UIDE Analysis M (2)Cognitive Task Analysis M (1)Task-Environment Analysis M (1)Knowledge Analysis M (2)Design Analysis M (2)Programmable User Models M (1)

SimulationInformation Proc. Modeling M (8)Petri Net Modeling FM (1)Genetic Algorithm Modeling (1)


aA number in parentheses indicates the number of UE methods sur-veyed for a particular method type and automation type. The effort levelfor each method is represented as: minimal (blank), formal (F), informal(I), and model (M). Four software tools provide automation support formultiple method types: DRUM, performance measurement and log fileanalysis; AMME, log file analysis and Petri net modeling; KALDI, perfor-mance measurement, log file analysis, and remote testing; and UsAGE,performance measurement and log file analysis.∗Either formal or informal interface use is required. In addition, a modelmay be used in the analysis.†Methods may or may not employ a model.



Table XIX. Descriptions of the WIMP UE Method Types Depicted in Table XVIII





Analytical ModelingGOMS Analysis predict execution and learning timeUIDE Analysis conduct GOMS analysis within a UIDECognitive Task Analysis predict usability problemsTask-Environment Analysis assess mapping of user’s goals into UI tasksKnowledge Analysis predict learnabilityDesign Analysis assess design complexityProgrammable User Models write program that acts like a user

SimulationInformation Proc. Modeling mimic user interactionPetri Net Modeling mimic user interaction from usage dataGenetic Algorithm Modeling mimic novice user interaction



Table XX. Automation Support for 57 Web UE Methodsa


TestingThinking-Aloud Protocol F (1)Question-Asking Protocol F (1)Shadowing Method F (1)Coaching Method F (1)Teaching Method F (1)Codiscovery Learning F (1)Performance Measurement F (1) F (3)Log File Analysis IFM (9)Retrospective Testing F (1)Remote Testing IF (3)

InspectionGuideline Review IF (4) (5) (6)Cognitive Walkthrough IF (2)Pluralistic Walkthrough IF (1)Heuristic Evaluation IF (1)Perspective-Based Inspection IF (1)Feature Inspection IF (1)Formal Usability Inspection F (1)Consistency Inspection IF (1)Standards Inspection IF (1)


Analytical ModelingNo Methods Surveyed

SimulationInformation Proc. Modeling M (1)Information Scent Modeling M (1)


aA number in parentheses indicates the number of UE methods sur-veyed for a particular method type and automation type. The effort levelfor each method is represented as: minimal (blank), formal (F), informal(I), and model (M). Two software tools provide automation support formultiple method types: Dome Tree visualization, log file analysis andinformation scent modeling; and WebVIP, performance measurementand remote testing.



Table XXI. Descriptions of the Web UE Method Types Depicted in Table XX





Analytical ModelingNo Methods Surveyed

SimulationInformation Proc. Modeling mimic user interactionInformation Scent Modeling mimic Web site navigation

ACKNOWLEDGMENTS

We thank the anonymous reviewers forinvaluable guidance on improving ourpresentation of this material. We thankJames Hom and Zhijun Zhang for allow-ing us to use their extensive archivesof usability methods for this survey. Wealso thank Zhijun Zhang for participatingin several interviews on usability evalu-ation. We thank Bonnie John and ScottHudson for helping us locate informationon GOMS and other simulation methodsfor this survey, and James Landay andMark Newman for helpful feedback anddata.

REFERENCES

ABELOW, D. 1993. Automating feedback on soft-ware product use. CASE Trends December, 15–17.

ADDY & ASSOCIATES. 2000. Dr. Watson version 4.0.Available at http://watson.addy.com/.

AHLBERG, C. AND SHNEIDERMAN, B. 1994. Visual in-formation seeking: Tight coupling of dynamicquery filters with starfield displays. In Proceed-ings of the Conference on Human Factors in Com-puting Systems (Boston, MA, April), pp. 313–317. New York, NY: ACM Press.

AL-QAIMARI, G. AND MCROSTIE, D. 1999. KALDI:A computer-aided usability engineering toolfor supporting testing and analysis of human-computer interaction. In J. Vanerdonckt andA. Puerta, Eds., Proceedings of the Third Inter-national Conference on Computer-Aided Designof User Interfaces (Louvain-la-Neuve, Belgium,October). Dordrecht, The Netherlands: KluwerAcademic Publishers.

ANDERSON, J. 1993. Rules of the Mind. Hillsdale,NJ: Lawrence Erlbaum Associates, Inc.

ANDERSON, J. R. 1990. The Adaptive Character ofThought. Hillsdale, NJ: Lawrence Erlbaum As-sociates, Inc.

BACHELDOR, B. 1999. Push for performance. Infor-mation Week September 20, 18–20.



BALBO, S. 1995. Automatic evaluation of user in-terface usability: Dream or reality. In S. Balbo,Ed., Proceedings of the Queensland Computer-Human Interaction Symposium (Queensland,Australia, August). Bond University.

BALBO, S. 1996. EMA: Automatic analysis mech-anism for the ergonomic evaluation of userinterfaces. Tech. Rep. 96/44 (August), CSIRODivision of Information Technology. Available athttp: //www.cmis.csiro.au /sandrine.balbo /Ema/ema tr/ema-tr.doc.

BARNARD, P. J. 1987. Cognitive resources and thelearning of human-computer dialogs. In J. M.Carroll, Ed., Interfacing Thought: Cognitive As-pects of Human-Computer Interaction, pp. 112–158. Cambridge, MA: MIT Press.

BARNARD, P. J. AND TEASDALE, J. D. 1991. Interact-ing cognitive subsystems: A systemic approachto cognitive-affective interaction and change.Cognition and Emotion 5, 1–39.

BAUMEISTER, L. K., JOHN, B. E., AND BYRNE, M. D.2000. A comparison of tools for building GOMSmodels. In Proceedings of the Conference on Hu-man Factors in Computing Systems, Volume 1(The Hague, The Netherlands, April), pp. 502–509. New York, NY: ACM Press.

BEARD, D. V., SMITH, D. K., AND DENELSBECK, K. M.1996. Quick and dirty GOMS: A case study ofcomputed tomography interpretation. Human-Computer Interaction 11, 2, 157–180.

BELLOTTI, V. 1988. Implications of current designpractice for the use of HCI techniques. In Pro-ceedings of the HCI Conference on People andComputers IV (Manchester, United Kingdom,September), pp. 13–34. Amsterdam, The Nether-lands: Elsevier Science Publishers.

BODART, F., HENNEBERT, A. M., LEHEUREUX, J. M., AND

VANDERDONCKT, J. 1994. Towards a dynamicstrategy for computer-aided visual placement.In T. Catarci, M. Costabile, M. Levialdi, andG. Santucci, Eds., Proceedings of the Interna-tional Conference on Advanced Visual Interfaces(Bari, Italy), pp. 78–87. New York, NY: ACMPress.

BORGES, J. A., MORALES, I., AND RODRIGUEZ, N. J. 1996.Guidelines for designing usable world wide webpages. In Proceedings of the Conference on Hu-man Factors in Computing Systems, Volume 2(Vancouver, Canada, April), pp. 277–278. NewYork, NY: ACM Press.

BOWERS, N. 1996. Weblint: Quality assurance forthe World Wide Web. In Proceedings of theFifth International World Wide Web Conference(Paris, France, May). Amsterdam, The Nether-lands: Elsevier Science Publishers. Available athttp://www5conf.inria.fr/fich html/papers/P34/Overview.html.

BRAJNIK, G. 2000. Automatic web usability eval-uation: Where is the limit? In Proceedingsof the Sixth Conference on Human Factors& the Web (Austin, TX, June). Available

at http://www.tri.sbc.com/hfweb/brajnik/hfweb-brajnik.html.

BYRNE, M. D., JOHN, B. E., WEHRLE, N. S., AND CROW,D. C. 1999. The tangled web we wove: A tax-onomy of WWW use. In Proceedings of the Con-ference on Human Factors in Computing Sys-tems, Volume 1 (Pittsburgh, PA, May), pp. 544–551, New York, NY: ACM Press.

BYRNE, M. D., WOOD, S. D., SUKAVIRIYA, P. N., FOLEY,J. D., AND KIERAS, D. 1994. Automating inter-face evaluation. In Proceedings of the Conferenceon Human Factors in Computing Systems, Vol-ume 1 (April), pp. 232–237. New York, NY: ACMPress.

CARD, S. K., MORAN, T. P., AND NEWELL, A. 1983.The Psychology of Human-Computer Interaction.Hillsdale, NJ: Lawrence Erlbaum Associates,Inc.

CENTERLINE. 1999. QC/Replay. Available at http://www.centerline.com/productline/qcreplay/qcre-play.html.

CHAK, A. 2000. Usability tools: A useful start.Web Techniques (2000), August, 18–20. Availableat http://www.webtechniques.com/archives/2000/08/stratrevu/.

CHI, E. H., PIROLLI, P., AND PITKOW, J. 2000. Thescent of a site: A system for analyzing and pre-dicting information scent, usage, and usabilityof a web site. In Proceedings of the Conferenceon Human Factors in Computing Systems (TheHague, The Netherlands, April), pp. 161–168.New York, NY: ACM Press.

CLARK, D. AND DARDAILLER, D. 1999. Accessibilityon the web: Evaluation & repair tools to makeit possible. In Proceedings of the CSUN Tech-nology and Persons with Disabilities Confer-ence (Los Angeles, CA, March). Available athttp://www.dinf.org/csun 99/session0030.html.

COMBER, T. 1995. Building usable web pages:An HCI perspective. In R. Debreceny andA. Ellis, Eds., Proceedings of the First Aus-tralian World Wide Web Conference (Ballina,Australia, April), pp. 119–124. Ballina,Australia: Norsearch. Available at http://www.scu.edu.au/sponsored/ausweb/ausweb95/papers/hypertext/comber/.

COOPER, M. 1999. Universal design of a website. In Proceedings of the CSUN Technol-ogy and Persons with Disabilities Confer-ence (Los Angeles, CA, March). Available athttp://www.dinf.org/csun 99/session0030.html.

COUTAZ, J. 1995. Evaluation techniques: Explor-ing the intersection of HCI and software engi-neering. In R. N. Taylor and J. Coutaz, Eds., Soft-ware Engineering and Human-Computer Inter-action, Lecture Notes in Computer Science, pp.35–48. Heidelberg, Germany: Springer-Verlag.

CUGINI, J. AND SCHOLTZ, J. 1999. VISVIP: 3D vi-sualization of paths through web sites. InProceedings of the International Workshop onWeb-Based Information Visualization (Florence,



Italy, September), pp. 259–263. Institute of Elec-trical and Electronics Engineers.

DE HAAN, G., VAN DER VEER, G. C., AND VAN VLIET, J. C.1992. Formal modelling techniques in human-computer interaction. In G. C. van der Veer,S. Bagnara, and G. A. M. Kempen, Eds., Cogni-tive Ergonomics: Contributions from Experimen-tal Psychology, Theoretical Issues, pp. 27–67.Amsterdam, The Netherlands: Elsevier SciencePublishers.

DE SOUZA, F. AND BEVAN, N. 1990. The use of guide-lines in menu interface design: Evaluation ofa draft standard. In G. Cockton, D. Diaper,and B. Shackel, Eds., Proceedings of the ThirdIFIP TC13 Conference on Human-Computer In-teraction (Cambridge, UK, August), pp. 435–440.Amsterdam, The Netherlands: Elsevier SciencePublishers.

DETWEILER, M. C. AND OMANSON, R. C. 1996.Ameritech web page user interface stan-dards and design guidelines. Ameritech Cor-poration, Chicago, IL. Available at http://www.ameritech.com/corporate/testtown/library/standard/web guidelines/index.html.

DIX, A., FINLAY, J., ABOWD, G., AND BEALE, R.1998. Human-Computer Interaction (seconded.). Prentice Hall, Upper Saddle River, NJ.

DROTT, M. C. 1998. Using web server logs to im-prove site design. In Proceedings of the 16th In-ternational Conference on Systems Documenta-tion (Quebec, Canada, September), pp. 43–50.New York, NY: ACM Press.

ETGEN, M. AND CANTOR, J. 1999. What doesgetting WET (web event-logging tool) meanfor web usability. In Proceedings of theFifth Conference on Human Factors & theWeb (Gaithersburg, MD, June). Availableat http://www.nist.gov/itl/div894/vvrg/hfweb/proceedings/etgen-cantor/index.html.

FARADAY, P. 2000. Visually critiquing web pages. InProceedings of the Sixth Conference on HumanFactors & the Web (Austin, TX, June). Avail-able at http://www.tri.sbc.com/hfweb/faraday/faraday.htm.

FARADAY, P. AND SUTCLIFFE, A. 1998. Providing ad-vice for multimedia designers. In Proceedings ofthe Conference on Human Factors in ComputingSystems, Volume 1 (Los Angeles, CA, April), pp.124–131. New York, NY: ACM Press.

FARENC, C. AND PALANQUE, P. 1999. A genericframework based on ergonomics rules forcomputer aided design of user interface. InJ. Vanderdonckt and A. Puerta, Eds., Pro-ceedings of the Third International Confer-ence on Computer-Aided Design of User In-terfaces (Louvain-la-Neuve, Belgium, October).Dordrecht, The Netherlands: Kluwer AcademicPublishers.

FARENC, C., LIBERATI, V., AND BARTHET, M.-F. 1999.Automatic ergonomic evaluation: What are thelimits. In Proceedings of the Third InternationalConference on Computer-Aided Design of User

Interfaces (Louvain-la-Neuve, Belgium, Octo-ber). Dordrecht, The Netherlands: Kluwer Aca-demic Publishers.

FOGG, B. J. 1999. What variables affectweb credibility? Available at http://www.webcredibility.org/variables files/v3 document.htm.

FOGG, B. J., MARSHALL, J., OSIPOVICH, A., VARMA,C., LARAKI, O., FANG, N., PAVI, J., RANGNEKAR,A., SHON, J., SWANI, P., AND TREINEN, M. 2000.Elements that affect web credibility: Earlyresults from a self-report study. In Proceed-ings of the Conference on Human Factorsin Computing Systems, number ExtendedAbstracts, 287–288. The Hague, The Nether-lands. New York, NY: ACM Press. Availableat http://www.webcredibility.org/webCredEarly-Results.ppt.

FULLER, R. AND DE GRAAFF, J. J. 1996. Measuringuser motivation from server log files. In Proceed-ings of the Second Conference on Human Fac-tors & the Web (Redmond, WA, October). Avail-able at http://www.microsoft.com/usability/ web-conf/fuller/fuller.htm.

GLENN, F. A., SCHWARTZ, S. M., AND ROSS, L. V. 1992.Development of a human operator simulator ver-sion v (HOS-V): Design and implementation.U.S. Army Research Institute for the Behavioraland Social Sciences, PERI-POX, Alexandria, VA.

GUZDIAL, M., SANTOS, P., BADRE, A., HUDSON, S.,AND GRAY, M. 1994. Analyzing and visu-alizing log files: A computational science ofusability. GVU Center TR GIT-GVU-94-8, Geor-gia Institute of Technology. Available at http://www.cc.gatech.edu/gvu/reports/1994/abstracts/94-08.html.

HAMMONTREE, M. L., HENDRICKSON, J. J., AND HENSLEY,B. W. 1992. Integrated data capture and anal-ysis tools for research and testing on graphicaluser interfaces. In Proceedings of the Conferenceon Human Factors in Computing Systems (Mon-terey, CA, May), pp. 431–432. New York, NY:ACM Press.

HARPER, B. AND NORMAN, K. 1993. Improving usersatisfaction: The questionnaire for user satisfac-tion interaction version 5.5. In Proceedings of theFirst Annual Mid-Atlantic Human Factors Con-ference (Virginia Beach, VA), pp. 224–228.

HARTSON, H. R., CASTILLO, J. C., KELSA, J., AND NEALE,W. C. 1996. Remote evaluation: The networkas an extension of the usability laboratory.In M. J. Tauber, V. Bellotti, R. Jeffries, J. D.Mackinlay, and J. Nielsen, Eds., Proceedings ofthe Conference on Human Factors in ComputingSystems (Vancouver, Canada, April), pp. 228–235. New York, NY: ACM Press.

HELFRICH, B. AND LANDAY, J. A. 1999. QUIP:quantitative user interface profiling. Unpub-lished manuscript. Available at http://home.earthlink.net/∼bhelfrich/quip/index.html.

HOCHHEISER, H. AND SHNEIDERMAN, B. 2001. Usinginteractive visualizations of WWW log data to



characterize access patterns and inform site de-sign. Journal of American Society for Informa-tion Science and Technology 52, 4 (February),331–343.

HOM, J. 1998. The usability methods toolbox.Available at http://www.best.com/∼jthom/usa-bility/usable.htm.

HUDSON, S. E., JOHN, B. E., KNUDSEN, K., AND BYRNE,M. D. 1999. A tool for creating predictive per-formance models from user interface demon-strations. In Proceedings of the 12th AnnualACM Symposium on User Interface Software andTechnology (Asheville, NC, November), pp. 92–103. New York, NY: ACM Press.

HUMAN FACTORS ENGINEERING. 1999. Usabilityevaluation methods. Available at http://www.cs.umd.edu/∼zzj/UsabilityHome.html.

INTERNATIONAL STANDARDS ORGANIZATION. 1999. Er-gonomic requirements for office work withvisual display terminals, part 11: Guidanceon usability. Available at http://www.iso.ch/iso/en/catalogueDetailPage.catalogueDetail? CS-Number=16883&ICS1=13&ICS2=180&ICS3=.

IVORY, M. Y. 2001. An empirical foundation forautomated web interface evaluation. PhD The-sis, University of California, Berkeley, ComputerScience Division.

IVORY, M. Y., SINHA, R. R., AND HEARST, M. A.2000. Preliminary findings on quantitativemeasures for distinguishing highly ratedinformation-centric web pages. In Proceedingsof the Sixth Conference on Human Factors& the Web (Austin, TX, June). Available athttp://www.tri.sbc.com/hfweb/ivory/paper.html.

IVORY, M. Y., SINHA, R. R., AND HEARST, M. A. 2001.Empirically validated web page design metrics.In Proceedings of the Conference on Human Fac-tors in Computing Systems (Seattle, WA, March),pp. 53–60. New York, NY: ACM Press.

JAIN, R. 1991. Human-Computer Interaction.Wiley-Interscience, New York, NY.

JEFFRIES, R., MILLER, J. R., WHARTON, C., AND UYEDA,K. M. 1991. User interface evaluation inthe real world: A comparison of four tech-niques. In Proceedings of the Conference onHuman Factors in Computing Systems (NewOrleans, LA, April), pp. 119–124. New York, NY:ACM Press.

JIANG, J., MURPHY, E., AND CARTER, L. 1993.Computer-human interaction models(CHIMES): Revision 3. Tech. Rep. DSTL-94-008 (May), National Aeronautics and SpaceAdministration.

JOHN, B. E. AND KIERAS, D. E. 1996. The GOMSfamily of user interface analysis techniques:Comparison and contrast. ACM Transactionson Computer-Human Interaction 3, 4, 320–351.

KASIK, D. J. AND GEORGE, H. G. 1996. Toward au-tomatic generation of novice user test scripts. InProceedings of the Conference on Human Factorsin Computing Systems, Volume 1 (Vancouver,

Canada, April), pp. 244–251. New York, NY:ACM Press.

KIERAS, D. AND POLSON, P. G. 1985. An approach tothe formal analysis of user complexity. Interna-tional Journal of Man-Machine Studies 22, 4,365–394.

KIERAS, D. E., WOOD, S. D., ABOTEL, K., AND HORNOF,A. 1995. GLEAN: A computer-based tool forrapid GOMS model usability evaluation of userinterface designs. In Proceedings of the EighthACM Symposium on User Interface Software andTechnology (Pittsburgh, PA, November), pp. 91–100. New York, NY: ACM Press.

KIERAS, D. E., WOOD, S. D., AND MEYER, D. E.1997. Predictive engineering models based onthe EPIC architecture for a multimodal high-performance human-computer interaction task.ACM Transactions on Computer-Human Inter-action 4, 3 (September), 230–275.

KIM, W. C. AND FOLEY, J. D. 1993. Providing high-level control and expert assistance in the userinterface presentation design. In S. Ashlund,K. Mullet, A. Henderson, E. Hollnagel, andT. White, Eds., Proceedings of the Conference onHuman Factors in Computing Systems (Amster-dam, The Netherlands, April), pp. 430–437. NewYork, NY: ACM Press.

KIRAKOWSKI, J. AND CLARIDGE, N. 1998. Hu-man centered measures of success in website design. In Proceedings of the FourthConference on Human Factors & the Web(Basking Ridge, NJ, June). Available at http://www.research.att.com/conf/hfweb/proceedings/kirakowski/index.html.

LAIRD, J. E. AND ROSENBLOOM, P. 1996. The evolu-tion of the Soar cognitive architecture. In D. M.Steier and T. M. Mitchell, Eds., Mind Matters: ATribute to Allen Newell, pp. 1–50. Mahwah, NJ:Lawrence Erlbaum Associates, Inc.

LARSON, K. AND CZERWINSKI, M. 1998. Web pagedesign: Implications of memory, structure andscent for information retrieval. In Proceed-ings of the Conference on Human Factors inComputing Systems, Volume 1 (Los Angeles,CA, April), pp. 25–32. New York, NY: ACMPress.

LECEROF, A. AND PATERNO, F. 1998. Automatic sup-port for usability evaluation. IEEE Transactionson Software Engineering 24, 10 (October), 863–888.

LEE, K. 1997. Motif FAQ. Available at http://www-bioeng.ucsd.edu/∼fvetter/misc/Motif-FAQ.txt.

LEVINE, R. 1996. Guide to web style. Sun Mi-crosytems. Available at http://www.sun.com/styleguide/.

LEWIS, C., POLSON, P. G., WHARTON, C., AND RIEMAN,J. 1990. Testing a walkthrough methodologyfor theory-based design of walk-up-and-use in-terfaces. In Proceedings of the Conference onHuman Factors in Computing Systems (Seattle,WA, April), pp. 235–242. New York, NY: ACMPress.



LOWGREN, J. AND NORDQVIST, T. 1992. Knowledge-based evaluation as design support for graphi-cal user interfaces. In Proceedings of the Confer-ence on Human Factors in Computing Systems(Monterey, CA, May), pp. 181–188. New York,NY: ACM Press.

LYNCH, G., PALMITER, S., AND TILT, C. 1999. Themax model: A standard web site user model. InProceedings of the Fifth Conference on HumanFactors & the Web (Gaithersburg, MD, June).Available at http://www.nist.gov/itl/div894/vvrg/hfweb/ proceedings/lynch/index.html.

LYNCH, P. J. AND HORTON, S. 1999. Web StyleGuide: Basic Design Principles for CreatingWeb Sites. Yale University Press. Available athttp://info.med.yale.edu/caim/manual/.

MACLEOD, M. AND RENGGER, R. 1993. The devel-opment of DRUM: A software tool for video-assisted usability evaluation. In Proceedings ofthe HCI Conference on People and ComputersVIII (Loughborough, UK, September), pp. 293–309. Cambridge University Press, Cambridge,UK.

MAHAJAN, R. AND SHNEIDERMAN, B. 1997. Visual &textual consistency checking tools for graphicaluser interfaces. IEEE Transactions on SoftwareEngineering 23, 11 (November), 722–735.

MAY, J. AND BARNARD, P. J. 1994. Supportive evalu-ation of interface design. In C. Stary, Ed., Pro-ceedings of the First Interdisciplinary Workshopon Cognitive Modeling and User Interface Design(Vienna, Austria, December).

MERCURY INTERACTIVE. 2000. Winrunner. Availableat http://www-svca.mercuryinteractive.com/products/winrunner/.

MOLICH, R., BEVAN, N., BUTLER, S., CURSON, I., KIND-LUND, E., KIRAKOWSKI, J., AND MILLER, D. 1998.Comparative evaluation of usability tests. InProceedings of the UPA Conference (Washington,DC, June), pp. 189–200. Usability Professionals’Association, Chicago, IL.

MOLICH, R., THOMSEN, A. D., KARYUKINA, B., SCHMIDT,L., EDE, M., VAN OEL, W., AND ARCURI, M.1999. Comparative evaluation of usabilitytests. In Proceedings of the Conference on Hu-man Factors in Computing Systems (Pittsburgh,PA, May), pp. 83–86. New York, NY: ACMPress.

MORAN, T. P. 1981. The command language gram-mar: A representation for the user interface of in-teractive computer systems. International Jour-nal of Man-Machine Studies 15, 1, 3–50.

MORAN, T. P. 1983. Getting into a system:External-internal task mapping analysis.In Proceedings of the Conference on HumanFactors in Computing Systems (Boston, MA,December), pp. 45–49. New York, NY: ACMPress.

NETRAKER. 2000. The NetRaker suite. Availableat http://www.netraker.com/info/applications/index.asp.

NIELSEN, J. 1993. Usability Engineering. Boston,MA: Academic Press.

OLSEN, JR., D. R. 1992. User Interface Manage-ment Systems: Models and Algorithms. SanMateo, CA: Morgan Kaufman Publishers, Inc.

OLSEN, JR., D. R. AND HALVERSEN, B. W. 1988. In-terface usage measurements in a user interfacemanagement system. In Proceedings of the ACMSIGGRAPH Symposium on User Interface Soft-ware (Alberta, Canada, October), pp. 102–108.New York, NY: ACM Press.

OPEN SOFTWARE FOUNDATION. 1991. OSF/MotifStyle Guide. Number Revision 1.1 (for OSF/Motifrelease 1.1). Englewood Cliffs, NJ: Prentice Hall.

PALANQUE, P., FARENC, C., AND BASTIDE, R. 1999.Embedding ergonomic rules as generic require-ments in the development process of interac-tive software. In A. Sasse and C. Johnson,Eds., Proceedings of IFIP TC13 Seventh Inter-national Conference on Human-Computer Inter-action (Edinburgh, Scotland, August). Amster-dam, The Netherlands: IOS Press.

PARUSH, A., NADIR, R., AND SHTUB, A. 1998. Eval-uating the layout of graphical user interfacescreens: Validation of a numerical, computerizedmodel. International Journal of Human Com-puter Interaction 10, 4, 343–360.

PATERNO, F. AND BALLARDIN, G. 1999. Model-aidedremote usability evaluation. In A. Sasse andC. Johnson, Eds., Proceedings of the IFIP TC13Seventh International Conference on Human-Computer Interaction (Edinburgh, Scotland, Au-gust), pp. 434–442. Amsterdam, The Nether-lands: IOS Press.

PATERNO, F., MANCINI, C., AND MENICONI, S. 1997.ConcurTaskTrees: Diagrammatic notation forspecifying task models. In Proceedings of theIFIP TC13 Sixth International Conference onHuman-Computer Interaction (Sydney, Aus-tralia, July), pp. 362–369. Sydney, Australia:Chapman and Hall.

PAYNE, S. J. AND GREEN, T. R. G. 1986. Task-actiongrammars: A model of the mental representa-tion of task languages. Human-Computer Inter-action 2, 93–133.

PECK, V. A. AND JOHN, B. E. 1992. Browser-soar:A computational model of a highly interactivetask. In Proceedings of the Conference on HumanFactors in Computing Systems (Monterey, CA,May), pp. 165–172. New York, NY: ACM Press.

PETRI, C. A. 1973. Concepts of net theory. In Math-ematical Foundations of Computer Science: Pro-ceedings of the Symposium and Summer School(High Tatras, Czechoslovakia, September), pp.137–146. Mathematical Institute of the SlovakAcademy of Sciences.

PEW, R. W. AND MAVOR, A. S., EDS. 1998. Mod-eling Human and Organizational Behavior:Application to Military Simulations. Washing-ton, DC: National Academy Press. Available athttp://books.nap.edu/html/model.



POLK, T. A. AND ROSENBLOOM, P. S. 1994. Task-independent constraints on a unified theoryof cognition. In F. Boller and J. Grafman,Eds., Handbook of Neuropsychology, Volume 9.Amsterdam, The Netherlands: Elsevier SciencePublishers.

RATNER, J., GROSE, E. M., AND FORSYTHE, C. 1996.Characterization and assessment of HTML styleguides. In Proceedings of the Conference on Hu-man Factors in Computing Systems, Volume 2(Vancouver, Canada, April), pp. 115–116. NewYork, NY: ACM Press.

RAUTERBERG, M. 1995. From novice to expert deci-sion behaviour: A qualitative modeling approachwith Petri nets. In Y. Anzai, K. Ogawa, andH. Mori, Eds., Symbiosis of Human and Artifact:Human and Social Aspects of Human-ComputerInteraction, Volume 20B of Advances in HumanFactors/Ergonomics (1995), pp. 449–454. Ams-terdam, The Netherlands: Elsevier Science Pub-lishers.

RAUTERBERG, M. 1996a. How to measure and toquantify usability of user interfaces. In A. Ozokand G. Salvendy, Eds., Advances in Applied Er-gonomics (1996), pp. 429–432. West Lafayette,IN: USA Publishing.

RAUTERBERG, M. 1996b. A Petri net based analyz-ing and modeling tool kit for logfiles in human-computer interaction. In Proceedings of Cog-nitive Systems Engineering in Process Control(Kyoto, Japan, November), pp. 268–275. Ky-oto University: Graduate School of EnergyScience.

RAUTERBERG, M. AND AEPPILI, R. 1995. Learning inman-machine systems: the measurement of be-havioural and cognitive complexity. In Proceed-ings of the IEEE Conference on Systems, Manand Cybernetics (Vancouver, BC, October), pp.4685–4690. Institute of Electrical and Electron-ics Engineers.

REISNER, P. 1984. Formal grammar as a tool foranalyzing ease of use: Some fundamental con-cepts. In J. C. Thomas and M. L. Schnei-der, Eds., Human Factors in Computer Sys-tems, pp. 53–78. Norwood, NJ: Ablex PublishingCorp.

REITERER, H. 1994. A user interface design assis-tant approach. In K. Brunnstein and E. Raubold,Eds., Proceedings of the IFIP 13th World Com-puter Congress, Volume 2 (Hamburg, Ger-many, August), pp. 180–187. Amsterdam, TheNetherlands: Elsevier Science Publishers.

RIEMAN, J., DAVIES, S., HAIR, D. C., ESEMPLARE, M.,POLSON, P., AND LEWIS, C. 1991. An automatedcognitive walkthrough. In Proceedings of theConference on Human Factors in ComputingSystems (New Orleans, LA, April), pp. 427–428.New York, NY: ACM Press.

SCAPIN, D., LEULIER, C., VANDERDONCKT, J., MARIAGE,C., BASTIEN, C., FARENC, C., PALANQUE, P., AND

BASTIDE, R. 2000. A framework for organiz-ing web usability guidelines. In Proceedings

of the Sixth Conference on Human Factors &the Web (Austin, TX, June). Available at http://www.tri.sbc.com/hfweb/scapin/Scapin.html.

SCHOLTZ, J. AND LASKOWSKI, S. 1998. Developingusability tools and techniques for designingand testing web sites. In Proceedings of theFourth Conference on Human Factors & theWeb (Basking Ridge, NJ, June). Availableat http://www.research.att.com/conf/hfweb/ pro-ceedings/scholtz/index.html.

SCHWARTZ, M. 2000. Web site makeover. Com-puterworld January 31. Available at http://www.computerworld.com/home/print.nsf/all/000126e3e2.

SEARS, A. 1995. AIDE: A step toward metric-basedinterface development tools. In Proceedings ofthe Eighth ACM Symposium on User Inter-face Software and Technology (Pittsburgh, PA,November), pp. 101–110. New York, NY: ACMPress.

SERVICE METRICS. 1999. Service metrics solutions.Available at http://www.servicemetrics.com/so-lutions/solutionsmain.asp.

SHNEIDERMAN, B. 1998. Designing the User Inter-face: Strategies for Effective Human-ComputerInteraction (third ed.). Reading, MA: Addison-Wesley.

SIOCHI, A. C. AND HIX, D. 1991. A study ofcomputer-supported user interface evaluationusing maximal repeating pattern analysis. InProceedings of the Conference on Human Factorsin Computing Systems (New Orleans, LA, April),pp. 301–305. New York, NY: ACM Press.

SMITH, S. L. 1986. Standards versus guidelinesfor designing user interface software. Behaviourand Information Technology 5, 1, 47–61.

SMITH, S. L. AND MOSIER, J. N. 1986. Guidelinesfor designing user interface software. Tech.Rep. ESD-TR-86-278, The MITRE Corporation,Bedford, MA 01730.

STEIN, L. D. 1997. The rating game. Available athttp://stein.cshl.org/∼lstein/rater/.

STREVELER, D. J. AND WASSERMAN, A. I. 1984. Quan-titative measures of the spatial properties ofscreen designs. In B. Shackel, Ed., Proceedingsof the IFIP TC13 First International Confer-ence on Human-Computer Interaction (London,UK, September), pp. 81–89. Amsterdam, TheNetherlands: North-Holland.

SULLIVAN, T. 1997. Reading reader reaction: A pro-posal for inferential analysis of web server logfiles. In Proceedings of the Third Conferenceon Human Factors & the Web (Boulder, CO,June). Available at http://www.research.att.com/conf/hfweb/conferences/denver3.zip.

TAUBER, M. J. 1990. ETAG: Extended task ac-tion grammar—A language for the descriptionof the user’s task language. In G. Cockton,D. Diaper, and B. Shackel, Eds., Proceedings ofthe IFIP TC13 Third International Conferenceon Human-Computer Interaction (Cambridge,



UK, August), pp. 163–168. Amsterdam, TheNetherlands: Elsevier Science Publishers.

THENG, Y. L. AND MARSDEN, G. 1998. Authoringtools: Towards continuous usability testing ofweb documents. In Proceedings of the FirstInternational Workshop on Hypermedia Devel-opment (Pittsburgh, PA, June). Available athttp://www.eng.uts.edu.au/∼dbl/HypDev/ht98w/YinLeng/HT98 YinLeng.html.

THIMBLEBY, H. 1997. Gentler: A tool for systematicweb authoring. International Journal of Human-Computer Studies 47, 1, 139–168.

TULLIS, T. S. 1983. The formatting of alphanu-meric displays: A review and analysis. HumanFactors 25, 657–682.

UEHLING, D. L. AND WOLF, K. 1995. User actiongraphing effort (UsAGE). In Proceedings of theConference on Human Factors in ComputingSystems (Denver, CO, May). New York, NY: ACMPress.

USABLE NET. 2000. LIFT online. Available athttp://www.usablenet.com.

VANDERDONCKT, J. AND GILLO, X. 1994. Visual tech-niques for traditional and multimedia layouts.In T. Catarci, M. Costabile, M. Levialdi, andG. Santucci, Eds., Proceedings of the Interna-tional Conference on Advanced Visual Interfaces(Bari, Italy, June), pp. 95–104. New York, NY:ACM Press.

WEB ACCESSIBILITY INITIATIVE. 1999. Web con-tent accessibility guidelines 1.0. World WideWeb Consortium, Geneva, Switzerland. Avail-able at http://www.w3.org/TR/WAI-WEBCON-TENT/.

WEB CRITERIA. 1999. Max, and the objectivemeasurement of web sites. Available athttp://www.webcriteria.com.

WEBTRENDS CORPORATION. 2000. Webtrends live.Available at http://www.webtrendslive.com/de-fault.htm.

WHITEFIELD, A., WILSON, F., AND DOWELL, J. 1991. Aframework for human factors evaluation. Be-haviour and Information Technology 10, 1, 65–79.

WILLIAMS, K. E. 1993. Automating the cognitivetask modeling process: An extension to GOMSfor HCI. In Proceedings of the Conference on Hu-man Factors in Computing Systems, Volume 3(Amsterdam, The Netherlands, April), pp. 182.New York, NY: ACM Press.

WORLD WIDE WEB CONSORTIUM. 2000. HTML vali-dation service. Available at http://validator.w3.org/.

YOUNG, R. M., GREEN, T. R. G., AND SIMON, T. 1989.Programmable user models for predictive eval-uation of interface designs. In Proceedings ofthe Conference on Human Factors in Comput-ing Systems (Austin, TX, April), pp. 15–19. NewYork, NY: ACM Press.

ZACHARY, W., MENTEC, J.-C. L., AND RYDER, J.1996. Interface agents in complex systems.In C. N. Ntuen and E. H. Park, Eds., Hu-man Interaction With Complex Systems: Con-ceptual Principles and Design Practice. Dor-drecht, The Netherlands: Kluwer AcademicPublishers.

ZAPHIRIS, P. AND MTEI, L. 1997. Depth vs. breadthin the arrangement of Web links. Available athttp://www.otal.umd.edu/SHORE/bs04.

ZETTLEMOYER, L. S., ST. AMANT, R. S., AND DULBERG,M. S. 1999. IBOTS: Agent control through theuser interface. In Proceedings of the Interna-tional Conference on Intelligent User Interfaces(Redondo Beach, CA, Jan.), pp. 31–37. New York,NY: ACM Press.

Received February 2000; revised January 2001; accepted April 2001


the state of the art in automating usability evaluation of user...

Documents