1 lecture 19 chapter 9 evaluation techniques (part 2)
TRANSCRIPT
![Page 1: 1 Lecture 19 chapter 9 evaluation techniques (part 2)](https://reader038.vdocuments.us/reader038/viewer/2022103100/56649f1d5503460f94c3495d/html5/thumbnails/1.jpg)
1
Lecture 19
chapter 9evaluation techniques
(part 2)
![Page 2: 1 Lecture 19 chapter 9 evaluation techniques (part 2)](https://reader038.vdocuments.us/reader038/viewer/2022103100/56649f1d5503460f94c3495d/html5/thumbnails/2.jpg)
2
Evaluating Implementations
Requires an artefact:a simulation, a prototype, or afull implementation
![Page 3: 1 Lecture 19 chapter 9 evaluation techniques (part 2)](https://reader038.vdocuments.us/reader038/viewer/2022103100/56649f1d5503460f94c3495d/html5/thumbnails/3.jpg)
3
Experimental evaluation
• controlled evaluation of specific aspects of interactive behaviour
• evaluator chooses hypothesis to be tested
• a number of experimental conditions are considered which differ only in the value of some controlled variable.
• changes in behavioural measure are attributed to different conditions
![Page 4: 1 Lecture 19 chapter 9 evaluation techniques (part 2)](https://reader038.vdocuments.us/reader038/viewer/2022103100/56649f1d5503460f94c3495d/html5/thumbnails/4.jpg)
4
Experimental factors
• Subjects (i.e., the users, aka ‘participants’)– who – representative, sufficient sample
• not the programmer’s friend, boss, etc.• huge variability in performance of individuals
• Variables– things to modify and measure
• Hypothesis– what you’d like to show
• Experimental design– how you are going to show it– Includes ‘Protocol’ – what the subjects do
![Page 5: 1 Lecture 19 chapter 9 evaluation techniques (part 2)](https://reader038.vdocuments.us/reader038/viewer/2022103100/56649f1d5503460f94c3495d/html5/thumbnails/5.jpg)
5
Variables
• independent variable (IV) characteristic changed to produce different
conditions e.g. interface style, number of menu items
• dependent variable (DV) characteristics measured in the experiment e.g. time taken, number of errors.
![Page 6: 1 Lecture 19 chapter 9 evaluation techniques (part 2)](https://reader038.vdocuments.us/reader038/viewer/2022103100/56649f1d5503460f94c3495d/html5/thumbnails/6.jpg)
6
Hypothesis
• prediction of outcome– framed in terms of IV and DV
e.g. “error rate will increase as font size decreases”
• null hypothesis:– states no difference between conditions– aim is to disprove this
e.g. null hyp. = “no change with font size”
![Page 7: 1 Lecture 19 chapter 9 evaluation techniques (part 2)](https://reader038.vdocuments.us/reader038/viewer/2022103100/56649f1d5503460f94c3495d/html5/thumbnails/7.jpg)
7
Experimental design
• “within groups” design (also called “repeated measures”)– each subject performs experiment under each condition– transfer of learning possible (practice makes
performance better; or alternatively fatigue or boredom makes it worse)
– less costly and less likely to suffer from user variation (each user is compared to themselves)
• between groups design– each subject performs under only one condition– no transfer of learning – more users required
![Page 8: 1 Lecture 19 chapter 9 evaluation techniques (part 2)](https://reader038.vdocuments.us/reader038/viewer/2022103100/56649f1d5503460f94c3495d/html5/thumbnails/8.jpg)
8
Within v. Between
• Consider a test on the difference of beer v. vodka martinis on reaction time– Null hypothesis – no difference in increase in reaction
time between the two beverages
• Design 1:– 30 people try beer; 30 other people try vodka – D.V.
is change in reaction time pre- v. post drinking• Not bad – be sure to randomize who goes into beer
group v. vodka group• But ‘power’ of the experiment will be reduced due to
the great variability of individuals in reaction to alcohol
![Page 9: 1 Lecture 19 chapter 9 evaluation techniques (part 2)](https://reader038.vdocuments.us/reader038/viewer/2022103100/56649f1d5503460f94c3495d/html5/thumbnails/9.jpg)
9
Within v. Between (contd.)
• Design 2:– All 60 people first try beer, then immediately try
vodka• Problem of carryover effect
• Better Design:– All 60 try beer, then a week later try vodka
• Now each individual is compared with themselves• Still possible problem of ordering effect (e.g., they
might get a little better at the reaction time test)
• Best Design:– 30 try beer, then a week later vodka; 30 try vodka
and then a week later beer
![Page 10: 1 Lecture 19 chapter 9 evaluation techniques (part 2)](https://reader038.vdocuments.us/reader038/viewer/2022103100/56649f1d5503460f94c3495d/html5/thumbnails/10.jpg)
10
Analysis of data
• Before you start to do any statistics:– look at data (e.g. average=5.25 – but 4.9 without outlier)– save original data
• Choice of statistical technique depends on– type of data– information required
• Type of data– discrete
• finite number of values• may be ordered, or
unordered (e.g., colors)– continuous
• any value0
2
4
6
8
10
12
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
![Page 11: 1 Lecture 19 chapter 9 evaluation techniques (part 2)](https://reader038.vdocuments.us/reader038/viewer/2022103100/56649f1d5503460f94c3495d/html5/thumbnails/11.jpg)
11
Analysis - types of test
• parametric– assume normal distribution– robust– powerful
• non-parametric– do not assume normal distribution– less powerful– more reliable
• contingency table– classify data by discrete attributes – count number of data items in each group
![Page 12: 1 Lecture 19 chapter 9 evaluation techniques (part 2)](https://reader038.vdocuments.us/reader038/viewer/2022103100/56649f1d5503460f94c3495d/html5/thumbnails/12.jpg)
12
Analysis of data (cont.)
• What information is required?– is there a difference?– how big is the difference?– how accurate is the estimate?
• Parametric and non-parametric tests mainly address first of these
![Page 13: 1 Lecture 19 chapter 9 evaluation techniques (part 2)](https://reader038.vdocuments.us/reader038/viewer/2022103100/56649f1d5503460f94c3495d/html5/thumbnails/13.jpg)
13
ANOVA – analysis of variance
• Quite easy to test whether there’s a significant difference between groups in Excel– Need to invoke
Tools/Add-ins/Analysis Toolpack to enable– Then just apply Tools/Data Analysis/ANOVA:
Single Factor to the data
![Page 14: 1 Lecture 19 chapter 9 evaluation techniques (part 2)](https://reader038.vdocuments.us/reader038/viewer/2022103100/56649f1d5503460f94c3495d/html5/thumbnails/14.jpg)
14
ANOVA from Excel
Anova: Single Factor
SUMMARYGroups Count Sum Average Variance
Group 1 5 82 16.4 9.3Group 2 5 67 13.4 10.3Group 3 7 79 11.28571 3.904762
ANOVASource of Variation SS df MS F P-value F critBetween Groups 76.28908 2 38.14454 5.244339 0.019959 3.738892Within Groups 101.8286 14 7.273469
Total 178.1176 16
If P-value < 0.05 then we usually say the result is ‘significant’ (result is more than expected chance variation)
Say we have three columns of numbers representing the time to complete a task for 5, 5 and 7 users using three variations of an interface
![Page 15: 1 Lecture 19 chapter 9 evaluation techniques (part 2)](https://reader038.vdocuments.us/reader038/viewer/2022103100/56649f1d5503460f94c3495d/html5/thumbnails/15.jpg)
15
When is a difference a difference?
• In the world of parametric stats, we look for a statistic to be large enough to be ‘significant’– On the Gaussian (‘normal’) curve a |Z|=1.96
leaves 95% of the area of thecurve behind so is acommon ‘criticalvalue’ forclaimingsignificance
![Page 16: 1 Lecture 19 chapter 9 evaluation techniques (part 2)](https://reader038.vdocuments.us/reader038/viewer/2022103100/56649f1d5503460f94c3495d/html5/thumbnails/16.jpg)
16
Parametric assumptions• Parametric statistics assume that
some mathematically elegant assumptions hold true for the data– E.g., ANOVA (and standard
‘regression’) assume, among other things, normally distributed random error
– Trivia: The mathematical form of the probability density function for the normal distribution is remarkably formidable
• Centres on mean, , and is flattened by standard deviation,
Galton machine simulates normal distribution (aka ‘bell curve’)
Exponential distribution models time between events happening with a constant average rate
![Page 17: 1 Lecture 19 chapter 9 evaluation techniques (part 2)](https://reader038.vdocuments.us/reader038/viewer/2022103100/56649f1d5503460f94c3495d/html5/thumbnails/17.jpg)
17
So, what to measure?
• Usually one (or several) of these things:– Speed / efficiency
• How many (or whatever) per unit time can the user process with this interface?
– Accuracy (errors)– Learnability (time to acquire the ability to do
something in particular with the interface)• And retention (how well they can do it some particular
time later)
– Satisfaction• Subjective assessment – how does the user feel about
the interface
![Page 18: 1 Lecture 19 chapter 9 evaluation techniques (part 2)](https://reader038.vdocuments.us/reader038/viewer/2022103100/56649f1d5503460f94c3495d/html5/thumbnails/18.jpg)
18
Subjective assessment
• Very important– If the user seriously doesn’t like it, probably there’s
something really wrong with the design (just maybe they can’t articulate what)
• Quantifiable– Yes/No, or better, Likert scale (ratings) through
interview or questionnaire• Need to ask the right questions
– E.g., don’t have leading questions (BAD: “Is this the absolute worst system you have used ever?”)
• And need to ask the questions well (so user reliably expresses what they mean)– E.g., don’t trip them up with double negatives
![Page 19: 1 Lecture 19 chapter 9 evaluation techniques (part 2)](https://reader038.vdocuments.us/reader038/viewer/2022103100/56649f1d5503460f94c3495d/html5/thumbnails/19.jpg)
19
Likert Scale
• Can be from 4 to 7 “points”• Usually about agreement to a phrase
– E.g., “I found the search function easy to use”– Strongly Agree, Agree, Neutral (optional), Disagree,
Strongly Disagree• May also be about importance
– E.g., “A site search function is…”– “not very important” to “extremely important”
• Or a general assessment– E.g., “The performance of the search function was…”– Poor, Fair, Satisfactory, Good, Excellent
• Great to ask open-ended questions, too– E.g., “What was the best aspect of the search function?”– But it’s the Likert scale data that you can quantify
![Page 20: 1 Lecture 19 chapter 9 evaluation techniques (part 2)](https://reader038.vdocuments.us/reader038/viewer/2022103100/56649f1d5503460f94c3495d/html5/thumbnails/20.jpg)
20
Dimensions and validity
• When designing a questionnaire…– Have in mind a few underlying issues that you are
trying to assess– Ask a few different questions that are coordinated
around each issue– Ask different ways – vary whether positive or
negative favours the issue in question– Ideally, verify the questionnaire by having people
role-play particularly happy or angry, and middle-of-the-road users
• And see if they answer the questions the way you’d expect!
![Page 21: 1 Lecture 19 chapter 9 evaluation techniques (part 2)](https://reader038.vdocuments.us/reader038/viewer/2022103100/56649f1d5503460f94c3495d/html5/thumbnails/21.jpg)
21
Questionnaire administration
• Face-to-face or telephone interviews– Esp. efficient (and unbiased) to have third party give
the telephone survey
• Mail-out (including email) questionnaire• Web-based questionnaire (maybe email out
URL)• Even a questionnaire is an experiment on
humans from the point of view of research ethics– Easier to achieve an ethical questionnaire
administration if it’s truly anonymous
![Page 22: 1 Lecture 19 chapter 9 evaluation techniques (part 2)](https://reader038.vdocuments.us/reader038/viewer/2022103100/56649f1d5503460f94c3495d/html5/thumbnails/22.jpg)
22
Response rates and bias
• Low response rate is a problem– Below 50% response rate, one wonders whether
respondents were exceptional (happiest, angriest or a mix; but not the “normal” folks)
– Better if you have some authority to motivate a response
• But back to issues of ethics – e.g., not truly anonymous if you know who to nag about non-response; and the “pressure” may be unfair
• Similar bias problems when you use volunteers for any experiment– Are these volunteers representative of your “normal”
users?
![Page 23: 1 Lecture 19 chapter 9 evaluation techniques (part 2)](https://reader038.vdocuments.us/reader038/viewer/2022103100/56649f1d5503460f94c3495d/html5/thumbnails/23.jpg)
23
Experimental studies on groupsMore difficult than single-user experiments
Problems with:– subject groups– choice of task– data gathering– Analysis
• Unfortunately (in terms of experimental requirements) a lot of things that are interesting in the real world, involve computers mediating group behaviour
![Page 24: 1 Lecture 19 chapter 9 evaluation techniques (part 2)](https://reader038.vdocuments.us/reader038/viewer/2022103100/56649f1d5503460f94c3495d/html5/thumbnails/24.jpg)
24
Subject groups
larger number of subjects more expensive
longer time to `settle down’… even more variation!
difficult to timetable
so … often only three or four groups
![Page 25: 1 Lecture 19 chapter 9 evaluation techniques (part 2)](https://reader038.vdocuments.us/reader038/viewer/2022103100/56649f1d5503460f94c3495d/html5/thumbnails/25.jpg)
25
Groups (contd.): Data gathering
several video cameras+ direct logging of application
problems:– synchronisation– sheer volume!
one solution:– record from each perspective
![Page 26: 1 Lecture 19 chapter 9 evaluation techniques (part 2)](https://reader038.vdocuments.us/reader038/viewer/2022103100/56649f1d5503460f94c3495d/html5/thumbnails/26.jpg)
26
Groups (contd.): Analysis
N.B. vast variation between groups
solutions:– ‘within groups’ experiments (each group works under
various conditions)– micro-analysis (e.g., gaps in speech)– anecdotal and qualitative analysis
controlled experiments may `waste' resources!– Experiments dominated by dynamics of group formation– Field studies are apt to be more realistic
![Page 27: 1 Lecture 19 chapter 9 evaluation techniques (part 2)](https://reader038.vdocuments.us/reader038/viewer/2022103100/56649f1d5503460f94c3495d/html5/thumbnails/27.jpg)
27
About statistics
• It’s an amazingly complex field– A lot of hidden complexities in running experiments and
saying that the observed differences really make a difference
• ‘threats to validity’ – are those things that make it possible that your experimental conclusion is in error
– Threats to internal validity: like carryover effects, or lack of randomization
– Threats to external validity: like that your whole population of subjects were unusual in some way, or the task was not representative of real use of the tool
– When the outcomes are serious (e.g., medical trials) professional statisticians are always used in design of the experiment as well as analysis and reporting of the findings
– Plenty of texts and courses on stats available (the Wikipedia is pretty good on this topics, too – e.g., for ANOVA)