marti hearst sims 247 sims 247 lecture 11 evaluating interactive interfaces february 24, 1998

Marti HearstSIMS 247

SIMS 247 Lecture 11SIMS 247 Lecture 11Evaluating Interactive InterfacesEvaluating Interactive Interfaces

February 24, 1998February 24, 1998


Many slides from this lecture are closely derived from Many slides from this lecture are closely derived from some created by Professor James Landay, 1997some created by Professor James Landay, 1997

Also, from Chapter 4 of Shneiderman 97Also, from Chapter 4 of Shneiderman 97


OutlineOutline

• Why do evaluation?Why do evaluation?• What types of evaluations?What types of evaluations?• Choosing participantsChoosing participants• Designing the testDesigning the test• Collecting dataCollecting data• Analyzing the dataAnalyzing the data• Drawing conclusionsDrawing conclusions


Why do Evaluation?Why do Evaluation?

• To tell how good or bad a visualization isTo tell how good or bad a visualization is– People must use it to evaluate it– Must compare against the status quo– Something that looks useful to the designer might

be too complex or superfluous for real users

• For iterative designFor iterative design– Interface might be almost right but require

adjustments– The interactive components might have problems

• To advance our knowledge of how people To advance our knowledge of how people understand and use technologyunderstand and use technology


Types of EvaluationTypes of Evaluation

• Expert ReviewsExpert Reviews• Usability TestingUsability Testing• Controlled Psychologically-Controlled Psychologically-

Oriented ExperimentsOriented Experiments• There are tradeoffs for each There are tradeoffs for each

– see McGrath paper


Types of EvaluationTypes of Evaluation• Expert ReviewsExpert Reviews

– Heuristic Evaluation• An expert critiques the design with respect to

certain performance criteria

– Cognitive Walkthrough• Make a low-fidelity prototype• Designers “walk through” or simulate what how a

user would use the design under various circumstances

– Relies on relevance of experts’ opinions (as opposed to what users experience)


Types of EvaluationTypes of Evaluation• Usability TestingUsability Testing

– Goals• Rapid assessment with real users• Find problems with the design

– Techniques• Carefuly chosen set of tasks• Only a few participants (as opposed to a scientific

experiment)

– Results• Recommended changes• (as opposed to acceptance or rejection of hypothesis)


Types of EvaluationTypes of Evaluation

• Techniques for Usability TestingTechniques for Usability Testing– “Thinking Aloud” while using the system– Acceptance Tests

• Does the interface meet the performance objectives?– Time for users to learn specific functions– Speed of task performance– Rate of errors– User retention of commands over time

– Surveys– Focus group discussions– Field Tests


Types of EvaluationTypes of Evaluation• Controlled Psychologically-Oriented ExperimentsControlled Psychologically-Oriented Experiments

– Usually part of a theoretical framework– Propose a testable hypothesis– Identify a small number of independent variables to

manipulate– Choose the dependent variables to measure– Judiciously choose participants– Control for biasing factors– Apply statistical methods to data analysis– Place results within the theory, or refine or refute the

theory if necessary, point direction to future work


Choosing ParticipantsChoosing Participants

• Should be representative of eventual users, Should be representative of eventual users, in terms ofin terms of– job-specific vocabulary / knowledge– tasks

• If you can’t get real users, get If you can’t get real users, get approximationapproximation– system intended for doctors

• get medical students

– system intended for electrical engineers• get engineering students

• Use incentives to get participantsUse incentives to get participants


Ethical ConsiderationsEthical Considerations

• Sometimes tests can be distressingSometimes tests can be distressing– users leave in tears– users can be embarrassed by mistakes

• You have a responsibility to alleviate thisYou have a responsibility to alleviate this– make voluntary – use informed consent– avoid pressure to participate– let them know they can stop at any time– stress that you are testing the system, not

them– make collected data as anonymous as possible

• Often must get human subjects approvalOften must get human subjects approval


User Study ProposalUser Study Proposal

• A report that containsA report that contains– objective– description of system being testing– task environment & materials– participants– methodology– tasks– test measures

• Get approved Get approved • Once this is done, it is useful for writing Once this is done, it is useful for writing

the final reportthe final report


Selecting TasksSelecting Tasks

• Should reflect what real tasks will be likeShould reflect what real tasks will be like– may need to shorten if

• they take too long• require background that test user won’t have

• Be sure tasks measure something directly Be sure tasks measure something directly related to your designrelated to your design

• But don’t bias the tasks so that only your But don’t bias the tasks so that only your design can windesign can win– should be a realistic task in order to avoid this

• Don’t choose tasks that are too fragmentedDon’t choose tasks that are too fragmented


Special Considerations for Special Considerations for Evaluating VisualizationsEvaluating Visualizations

• Be careful about what is being comparedBe careful about what is being compared• Example of how to do it wrong:Example of how to do it wrong:

– One study compared a web path history visualization that had

• thumbnails• fisheye properties• hierarchical layout

– against the Netscape textual history list

• Problem:Problem:– too many variables changed at once!– can’t tell which of the novel properties caused the

effects


Important Factors Important Factors

• Novices vs. ExpertsNovices vs. Experts– often no effect is found for experts, or

experts are slowed down at the same time that novices are helped

– experts might know the domain while novices do not• need to try to separate learning about

the domain from learning about the visualization


Important FactorsImportant Factors

• Perceptual abilitiesPerceptual abilities– spatial abilities tests– colorblindness– handedness (lefthanded vs.

righthanded)


The “Thinking Aloud” MethodThe “Thinking Aloud” Method• This is for usability testing, not formal This is for usability testing, not formal

experimentsexperiments• Need to know what users are thinking, not just Need to know what users are thinking, not just

what they are doingwhat they are doing• Ask users to talk while performing tasksAsk users to talk while performing tasks

– tell us what they are thinking– tell us what they are trying to do– tell us questions that arise as they work– tell us things they read

• Make a recording or take good notesMake a recording or take good notes– make sure you can tell what they were doing


Thinking Aloud (cont.)Thinking Aloud (cont.)

• Prompt the user to keep talkingPrompt the user to keep talking– “tell me what you are thinking”

• Only help on things you have pre-decidedOnly help on things you have pre-decided– keep track of anything you do give help on

• RecordingRecording– use a digital watch/clock– take notes– keep a computerized log of what actions

taken– if possible

• record audio and video


Pilot StudyPilot Study

• Goal:Goal:– help fix problems with the study– make sure you are measuring what you

mean to be

• Procedure:Procedure:– do twice,

• first with colleagues• then with real users

– usually end up making changes both times


Instructions to Participants Instructions to Participants (Gomoll 90)(Gomoll 90)

• Describe the purpose of the evaluationDescribe the purpose of the evaluation– “I’m testing the product; I’m not testing you”

• Tell them they can quit at any timeTell them they can quit at any time• Demonstrate the equipmentDemonstrate the equipment• Explain how to think aloudExplain how to think aloud• Explain that you will not provide helpExplain that you will not provide help• Describe the taskDescribe the task

– give written instructions


Designing the ExperimentDesigning the Experiment

• Reducing variability Reducing variability – recruit test users with similar background– brief users to bring them to common level– perform the test the same way every time

• don’t help some more than others (plan in advance)

– make instructions clear– control for outside factors

• Evaluating an interface that uses web hyperlinks can cause problems

– variability in network traffic can effect results

– participants should be run under as similar conditions as possible

– try to eliminate outside interruptions


Comparing Two AlternativesComparing Two Alternatives

• Between groupsBetween groups experiment experiment– two groups of test users– each group uses only 1 of the systems

• Within groupsWithin groups experiment experiment– one group of test users– each person uses both systems– can’t use the same tasks (learning effects)

• See if differences are See if differences are statistically statistically significantsignificant– assumes normal distribution & same std.

dev.

BA


Experimental DetailsExperimental Details

• Order of tasksOrder of tasks– between groups

• choose one simple order (simple -> complex)

– within groups• must vary the ordering to make sure there are no

effects based on the order in which the tasks occurred

• TrainingTraining– depends on how the real system will be used

• What if someone doesn’t finishWhat if someone doesn’t finish– assign very large time & large # of errors


MeasurementsMeasurements

• Attributes that are useful to measureAttributes that are useful to measure– time requirements for task completion– successful task completion– compare two designs on speed or # of

errors– application-specific measures

• e.g., how many web pages visited

• Time is easy to recordTime is easy to record• Error or successful completion is harderError or successful completion is harder

– define in advance what these mean


Measuring User PreferenceMeasuring User Preference

• How much users like or dislike the systemHow much users like or dislike the system– can ask them to rate on a scale of 1 to 10– or have them choose among statements

• “this visualization helped me with the problem…”, • hard to be sure what data will mean• novelty of UI, feelings, not realistic setting, etc.

• If many give you low ratings, you are in If many give you low ratings, you are in troubletrouble

• Can get some useful data by askingCan get some useful data by asking– what they liked, disliked, where they had

trouble, best part, worst part, etc. (redundant questions)


Debriefing ParticipantsDebriefing Participants

• Interview the participants at the end of Interview the participants at the end of the studythe study

• Ask structured questionsAsk structured questions• Ask general open-ended questions about Ask general open-ended questions about

the interfacethe interface– Subjects often don’t remember details

• video segments can help with this

– Ask for comments on specific features• show them screen (online or on paper) and then

ask questions


Analyzing the NumbersAnalyzing the Numbers• Example: trying to get task time <=30 min. Example: trying to get task time <=30 min.

– test gives: 20, 15, 40, 90, 10, 5– mean (average) = 30– median (middle) = 17.5– looks good! – wrong answer, not certain of anything

• Factors contributing to our uncertaintyFactors contributing to our uncertainty– small number of test users (n = 6)– results are very variable (standard deviation =

32)• std. dev. measures dispersal from the mean


Analyzing the Numbers (cont.)Analyzing the Numbers (cont.)• This is what statistics are forThis is what statistics are for

– Get statistics book– Landay recommends (for undergrads)

• The Cartoon Guide to Statistics, Gonick and Smith

• Crank through the procedures and you findCrank through the procedures and you find– 95% certain that typical value is between 5 & 55

• Usability test data is quite variableUsability test data is quite variable– Need many subjects to get good estimates of typical

values– 4 times as many tests will only narrow range by 2

times• breadth of range depends on sqrt of # of test users


Analyzing the DataAnalyzing the Data

• Summarize the dataSummarize the data– make a list of all critical incidents (CI)

• positive: something they liked or worked well• negative: difficulties with the UI

– include references back to original data– try to judge why each difficulty occurred

• What does data tell you?What does data tell you?– Does the visualization work the way you

thought it would?– Is something missing?


Using the ResultsUsing the Results

• For usability testing:For usability testing:• Update task analysis and rethink design Update task analysis and rethink design

– rate severity & ease of fixing CIs– fix both severe problems & make the easy fixes

• Will thinking aloud give the right answers?Will thinking aloud give the right answers?– not always– if you ask a question, people will always give an

answer, even it is has nothing to do with the facts– try to avoid specific questions


Study Good Examples of Study Good Examples of Experiments!Experiments!

• Papers in readerPapers in reader– by Byrne on Icons

• Studies done by Shneiderman’s Studies done by Shneiderman’s studentsstudents– www.otal.umd.edu/Olive/Class


Byrne Icon StudyByrne Icon Study

• Question: do icons facilitate Question: do icons facilitate searching for objects in a graphical searching for objects in a graphical UI?UI?– Do they work better than lists of file

names? – What characteristics of icons work

best?



• A Task Analysis (of how icons are A Task Analysis (of how icons are used) identified three kinds of used) identified three kinds of factors:factors:– General factors– Visual search factors– Semantic search factors

• Twelve factors, all toldTwelve factors, all told• Only a subset will be investigatedOnly a subset will be investigated



• Theoretical Model: Theoretical Model: – Model of Mixed Search

• Icon search involves two kinds of search that may or may not be related

– the icon picture– the textual name associated with it

• This leads to two kinds of search that get mixed together

– visual search– semantic search

• The visual characteristics of the icon will partly determine visual search time


Byrne Icon StudyByrne Icon Study• Goals of experiment are two-foldGoals of experiment are two-fold

– estimate effects of several of the factors– evaluate the Mixed Search model

• Mixed search model depends on timing for visual search, so vary parameters relating to visual search:

– vary complexity of visual form of icons– vary icon set size – color kept constant

• Model also depends on association of meaning with icons, so vary type of knowledge needed:

– file name knowledge– picture knowledge


Byrne Icon StudyByrne Icon Study• MethodMethod

– Participants• 45 undergrads getting extra credit in a course• note this will probably allow for statistical significance

– Materials• instructions and stimuli presented in hypercard

– Design of Experiment• Complex:

– one factor evaluated between subjects» icon type, one of three levels

– four factors evaluated within subjects» set size of icons» match level varied within sizes» amount of a priori picture knowledge» amount of a priori filename knowledge



• Procedures and StimuliProcedures and Stimuli– Participants got 3 practice runs and

72 experimental runs– Each run had three stages

• encoding (see the target document)• decay (do something else to distract)• search (find the correct icon as quickly as

possible)


Byrne Icon StudyByrne Icon Study• Some ResultsSome Results

– Mean across all conditions was 4.36 seconds with a standard deviation of 6.61s

– Outliers discarded– No overall effect for icon type– But interactions occurred

• The search time was effected by a combination of the picture knowledge and icon type

– participants could only make use of the meaning of the picture if the meaning was displayed in a simple way



• More ResultsMore Results• The search time was effected by a

combination of filename knowledge and icon type

– participants did best with blank icon if they knew the filename!

– If the did not know the filename, participants did best with the simple icon

• Simple icons were faster to search on overall

– compared against a baseline of blank icons


Byrne Icon StudyByrne Icon Study• ConclusionsConclusions

– For icons to be effective aids in visual search• they should be simple• they should be easily discriminable from one another• simple icons

– more effective for large set sizes– allow user to use knowledge of picture meaning– are less effected by lack of filename knowledge

• complex icons– are worse than blank!

– Support found for the mixed search two-pass processing model


SummarySummary• User evaluation is important, but takes time User evaluation is important, but takes time

& effort& effort• Early testing can be done on a mock-ups Early testing can be done on a mock-ups

(low-fi)(low-fi)• Use real tasks & representative participantsUse real tasks & representative participants• Be ethical & treat your participants wellBe ethical & treat your participants well• Goal: learn what people are doing & whyGoal: learn what people are doing & why• Doing scientific experiments requires more Doing scientific experiments requires more

users to get statistically reliable resultsusers to get statistically reliable results

marti hearst sims 247 sims 247 lecture 11 evaluating interactive interfaces february 24, 1998

Documents

marti hearst sims

types of evaluation

technology slide

types of evaluations

rejection of hypothesis

future work slide

mcgrath paper slide

real users