user testing & automated evaluation

48
Prof. James A. Landay University of Washington Autumn 2012 USER INTERFACE DESIGN + PROTOTYPING + EVALUATION User Testing & Automated Evaluation

Upload: daisy

Post on 25-Feb-2016

44 views

Category:

Documents


9 download

DESCRIPTION

User Testing & Automated Evaluation. Apple One Button Mouse. Product Hall of Fame or Shame?. Product Hall of Shame!. How to hold this? No tactile clue that you were holding the mouse in the correct orientation - PowerPoint PPT Presentation

TRANSCRIPT

User Testing & Automated Evaluation

User Testing & Automated EvaluationProf. James A. LandayUniversity of WashingtonAutumn 2012USER INTERFACE DESIGN + PROTOTYPING + EVALUATION1Took 1:12 minutes and did course eval at end (sped a bit at end cut some redundant slides)

Product Hall of Fame or Shame?Apple One Button Mouse

2

Product Hall of Shame!

How to hold this?No tactile clue that you were holding the mouse in the correct orientationLater designs added a dimple in the button yet remained ergonomically difficult to use

3User Testing & Automated EvaluationProf. James A. LandayUniversity of WashingtonAutumn 2012USER INTERFACE DESIGN + PROTOTYPING + EVALUATION4Took 1:20 for the main part and 1:50 including automated usability section added at the endOutlineVisual design reviewWhy do user testing?Choosing participantsDesigning the testCollecting dataAnalyzing the dataAutomated evaluation11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation55Visual Design ReviewGrid systems help us put information on the page in a logical mannersimilar things close togetherSmall changes help us see key differences (e.g., small multiples)RGB color space leads to bad colorsUse color properly not for ordering!Avoid clutter remove until you can remove no more11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation66

Why do User Testing?Cant tell how good UI is until?people use it!

Expert review methods are based on evaluators who?may know too muchmay not know enough (about tasks, etc.)

Hard to predict what real users will do7Choosing ParticipantsRepresentative of target usersjob-specific vocab / knowledgetasksApproximate if neededsystem intended for doctors?get medical students or nursessystem intended for engineers?get engineering studentsUse incentives to get participants

8Ethical ConsiderationsUsability tests can be distressingusers have left in tears

You have a responsibility to alleviatemake voluntary with informed consent (form)avoid pressure to participatelet them know they can stop at any timestress that you are testing the system, not themmake collected data as anonymous as possibleOften must get human subjects approval11/27/2012

CSE 440: User Interface Design, Prototyping, & Evaluation99User Test ProposalA report that containsobjectivedescription of system being testingtask environment & materialsparticipantsmethodologytaskstest measures

Get approved & then reuse for final reportSeems tedious, but writing this will help debug your test11/27/2012

CSE 440: User Interface Design, Prototyping, & Evaluation1010Selecting TasksShould reflect what real tasks will be likeTasks from analysis & design can be usedmay need to shorten ifthey take too longrequire background that test user wont have

Try not to train unless that will happen in real deploymentAvoid bending tasks in direction of what your design best supportsDont choose tasks that are too fragmentede.g., phone-in bank test11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation1111Two Types of Data to CollectProcess dataobservations of what users are doing & thinking

Bottom-line datasummary of what happened (time, errors, success)i.e., the dependent variables

11/27/2012

CSE 440: User Interface Design, Prototyping, & Evaluation1212Which Type of Data to Collect?Focus on process data firstgives good overview of where problems areBottom-line doesnt tell you ? where to fixjust says: too slow, too many errors, etc.Hard to get reliable bottom-line resultsneed many users for statistical significance11/27/2012

CSE 440: User Interface Design, Prototyping, & Evaluation1313The Thinking Aloud MethodNeed to know what users are thinking, not just what they are doingAsk users to talk while performing taskstell us what they are thinkingtell us what they are trying to dotell us questions that arise as they worktell us things they readMake a recording or take good notesmake sure you can tell what they were doing 11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation1414Thinking Aloud (cont.)Prompt the user to keep talkingtell me what you are thinking

Only help on things you have pre-decidedkeep track of anything you do give help on

Recordinguse a digital watch/clocktake notes, plus if possiblerecord audio & video (or even event logs)11/27/2012

CSE 440: User Interface Design, Prototyping, & Evaluation151511/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation16

16Using the Test ResultsSummarize the datamake a list of all critical incidents (CI)positive & negativeinclude references back to original datatry to judge why each difficulty occurred

What does data tell you?UI work the way you thought it would?users take approaches you expected?something missing?11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation1717Using the Results (cont.)Update task analysis & rethink design rate severity & ease of fixing CIsfix both severe problems & make the easy fixes11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation1818Not always

If you ask a question, people will always give an answer, even it is has nothing to do with factspanty hose example

Try to avoid specific questions11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation19Will thinking out loud give the right Answers?

19Analyzing the NumbersExample: trying to get task time 30 min. test gives: 20, 15, 40, 90, 10, 5mean (average) = 30median (middle) = 17.5looks good! Did we achieve our goal?Wrong answer, not certain of anything!Factors contributing to our uncertainty?small number of test users (n = 6)results are very variable (standard deviation = 32)std. dev. measures dispersal from the mean11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation20

20Measuring Bottom-Line UsabilitySituations in which numbers are usefultime requirements for task completionsuccessful task completion %compare two designs on speed or # of errorsEase of measurementtime is easy to recorderror or successful completion is harderdefine in advance what these meanDo not combine with thinking-aloud. Why?talking can affect speed & accuracy

11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation21

21Analyzing the Numbers (cont.)This is what statistics is for

Crank through the procedures and you find95% certain that typical value is between 5 & 5511/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation2222Analyzing the Numbers (cont.)11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation23

23Analyzing the Numbers (cont.)This is what statistics is for

Crank through the procedures and you find95% certain that typical value is between 5 & 55

Usability test data is quite variableneed lots to get good estimates of typical values4 times as many tests will only narrow range by 2xbreadth of range depends on sqrt of # of test usersthis is when online methods become usefuleasy to test w/ large numbers of users 11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation2424Measuring User PreferenceHow much users like or dislike the systemcan ask them to rate on a scale of 1 to 10or have them choose among statementsbest UI Ive ever, better than averagehard to be sure what data will meannovelty of UI, feelings, not realistic setting If many give you low ratings trouble

Can get some useful data by askingwhat they liked, disliked, where they had trouble, best part, worst part, etc.redundant questions are OK11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation25

25Comparing Two AlternativesBetween groups experiment

two groups of test userseach group uses only 1 of the systems

Within groups experiment

one group of test userseach person uses both systemscant use the same tasks or order (learning)best for low-level interaction techniques11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation26

BA26Comparing Two AlternativesBetween groups requires many more participants than within groups

See if differences are statistically significantassumes normal distribution & same std. dev.

Online companies can do large AB testslook at resulting behavior (e.g., buy?)11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation2727Experimental DetailsOrder of taskschoose one simple order (simple complex)unless doing within groups experiment

Training depends on how real system will be used

What if someone doesnt finishassign very large time & large # of errors or remove & note

Pilot studyhelps you fix problems with the studydo two, first with colleagues, then with real users11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation2828Instructions to ParticipantsDescribe the purpose of the evaluationIm testing the product; Im not testing youTell them they can quit at any timeDemonstrate the equipmentExplain how to think aloudExplain that you will not provide helpDescribe the taskgive written instructions, one task at a time11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation2929Details (cont.)Keeping variability downrecruit test users with similar backgroundbrief users to bring them to common levelperform the test the same way every timedont help some more than others (plan in advance)make instructions clear

Debriefing test usersoften dont remember, so demonstrate or show video segmentsask for comments on specific featuresshow them screen (online or on paper)11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation3030

Reporting the ResultsReport what you did & what happenedImages & graphs help people get it!Video clips can be quite convincing11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation

31AUTOMATED & REMOTE USABILITY EVALUATIONAutomated Analysis & Remote TestingLog analysisinfer user behavior by looking at web server logs

A-B Testingshow different user segments different designsrequires live site (built) & customer basemeasure outcomes (profit), but not why?

Remote user testingsimilar to in lab, but online (e.g., over Skype)11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation33Web Logs Analysis Difficult11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation34

Just tell you IP address of folks, what resource they are accessing (clicked on link, etc.)

Google Analytics & other tools try to make this analysis more user friendly34

Google Analytics Server Logs++11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation35http://www.redflymarketing.com/blog/using-google-analytics-to-improve-conversions/

Google Analytics Server Logs++11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation36http://www.redflymarketing.com/blog/using-google-analytics-to-improve-conversions/

The simplest example of data driving innovation is on the web. Even the WSJ has picked this up (though it has been going on for at least 5-10 years)37Web Allows Controlled A/B ExperimentsExample: Amazon Shopping CartAdd item to cartSite shows cart contentsIdea: show recommendations based on cart itemsArgumentsPro: cross-sell more itemsCon: distract people at check out Highest Paid Persons Opinion Stop the project!Simple experiment was run, wildly successful11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation38From Greg Lindens Blog: http://glinden.blogspot.com/2006/04/early-amazon-shopping-cart.html Shopping Cart Image Goes Here

38show recommendations based on cart items

The rest is history. Recommendations are now standard on many sitesADD IMAGE of shopping cart recs

Windows Marketplace: Solitaire vs. Poker

39B: Poker gameA: Solitaire game

Which image has the higher clickthrough? By how much?A is 61% better. Why?Courtesy of Ronny Kohavi11/27/2012CSE 440: User Interface Design, Prototyping, & EvaluationHow many think Poker? How many think Solitaire?B is 38% worse, or A is 61% better Why?

Note: THIS IS EASY TO DO ON THE WEB. But how do we do for OTHER PRODUCTS? The products of the future will be more a combination of hardware, software, and adaptive to environment/context. And How do we know WHAT product to make? And How do we find out WHY they preferred one over the other?39The Trouble With Most Web Site Analysis ToolsUnknownsWho?What? Why?Did they find it?Satisfied?11/27/2012Leave

CSE 440: User Interface Design, Prototyping, & Evaluation40Who are they?What are they looking for? Why did they go where they went?Did they find what they were looking for?Did they leave satisfied?NetRaker Usability ResearchSee how customers accomplish real tasks on site11/27/2012

CSE 440: User Interface Design, Prototyping, & Evaluation41

NetRaker Usability ResearchSee how customers accomplish real tasks on site

NetRaker Usability ResearchSee how customers accomplish real tasks on siteUserZoom11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation44

Advantages of Remote Usability TestingFastcan set up research in 3-4 hoursget results in 36 hoursMore accurate can run with large samples (50-200 users stat. sig.)uses real people (customers) performing tasksnatural environment (home/work/machine)Easy-to-usetemplates make setting up easyCan compare with competitorsindexed to national norms11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation45Disadvantages of Remote Usability TestingMiss observational feedbackfacial expressionsverbal feedback (critical incidents)

Need to involve human participantscosts some amount of money (typically $20-$50/person)

People often do not like pop-upsneed to be careful when using them11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation46SummaryUser testing is important, but takes time/effortEarly testing can be done on mock-ups (low-fi)Use ????? tasks & ????? participantsreal tasks & representative participantsBe ethical & treat your participants wellWant to know what people are doing & why? collectprocess dataBottom line data requires ???? to get statistically reliable resultsmore participantsDifference between between & within groups?between groups: everyone participates in one conditionwithin groups: everyone participates in multiple conditionsAutomated usabilityfaster than traditional techniquescan involve more participants convincing dataeasier to do comparisons across sitestradeoff with losing observational data11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation4747Next TimeInteractive Prototype Presentations

11/27/2012CSE 440: User Interface Design, Prototyping, & Evaluation4848Sheet1Web Usability Test ResultsParticipant #Time (minutes)12021534049051065number of participants6mean30.0median17.5std dev31.8standard error of the mean= stddev / sqrt (#samples)13.0typical values will be mean +/- 2*standard error--> 4 to 56!what is plausible? = confidence (alpha=5%, stddev, sample size)25.4--> 95% confident between 5 & 56

Sheet2

Sheet3