user performance versus precision measures for simple search tasks ( don’t bother improving map )
DESCRIPTION
User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP ). Andrew Turpin Falk Scholer {aht,fscholer}@cs.rmit.edu.au. People in glass houses should not throw stones. http://www.hartley-botanic.co.uk/hartley_images/victorian_range/victorian_range_09.jpg. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )](https://reader035.vdocuments.us/reader035/viewer/2022062323/568158d6550346895dc61ef0/html5/thumbnails/1.jpg)
User Performance versus Precision Measures for Simple Search Tasks
(Don’t bother improving MAP)
Andrew Turpin
Falk Scholer
{aht,fscholer}@cs.rmit.edu.au
![Page 2: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )](https://reader035.vdocuments.us/reader035/viewer/2022062323/568158d6550346895dc61ef0/html5/thumbnails/2.jpg)
People in glass houses should not throw stones
http://www.hartley-botanic.co.uk/hartley_images/victorian_range/victorian_range_09.jpg
![Page 3: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )](https://reader035.vdocuments.us/reader035/viewer/2022062323/568158d6550346895dc61ef0/html5/thumbnails/3.jpg)
Scientists should not live in glass houses.Nor straw, nor wood…
http://www-math.uni-paderborn.de/~odenbach/pics/pigs/pig2.jpg
![Page 4: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )](https://reader035.vdocuments.us/reader035/viewer/2022062323/568158d6550346895dc61ef0/html5/thumbnails/4.jpg)
Scientists should do more than throw stones
www.worth1000.com/entries/ 161000/161483INPM_w.jpg
![Page 5: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )](https://reader035.vdocuments.us/reader035/viewer/2022062323/568158d6550346895dc61ef0/html5/thumbnails/5.jpg)
Overview
• How are IR systems compared?– Mean Average Precision: MAP
• Do metrics match user experience?• First grain (Turpin & Hersh SIGIR 2000)• Second pebble (Turpin & Hersh SIGIR 2001)• Third stone (Allan et al SIGIR 2005)• This golf ball (Turpin & Scholer SIGIR 2006)
![Page 6: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )](https://reader035.vdocuments.us/reader035/viewer/2022062323/568158d6550346895dc61ef0/html5/thumbnails/6.jpg)
0
0
P@5 1/5 = 0.20 2/5 = 0.40
P@1 0/1 = 0.00 0/1 = 0.00
0.00
0.25
0.20
0.17
0.00
0
1
0
0.000
1
0
0
0
1
0
0.00
0.00
0.67
0.25
0.40
0.33
AP Av. of P at 1’s= 0.25 Av. of P at 1’s= 0.54
![Page 7: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )](https://reader035.vdocuments.us/reader035/viewer/2022062323/568158d6550346895dc61ef0/html5/thumbnails/7.jpg)
Sum of all precision values at relevant documents Number of relevant docs in the list
Sum of all precision values at relevant documents Number of relevant docs in all lists
AP =
AP =
(0.25) / 1 =
(0.67 + 0.40) / 2 =
0.25
0.54
0.08
0.36
(0.25) / 3 =
(0.67 + 0.40) / 3 =
![Page 8: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )](https://reader035.vdocuments.us/reader035/viewer/2022062323/568158d6550346895dc61ef0/html5/thumbnails/8.jpg)
Mean Average Precision (MAP)
• Previous example showed precision for one query
• Ideally need many queries (50 or more)• Take the mean of the AP values over all
queries: MAP• Do a paired t-test, Wilcoxon, Tukey HSD,
…• Compares systems on the same
collection and same queries
![Page 9: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )](https://reader035.vdocuments.us/reader035/viewer/2022062323/568158d6550346895dc61ef0/html5/thumbnails/9.jpg)
Similarity Measure
Simple Terms
Simple Terms + Phrases
Percentage Improvement
Lnu.ltu 0.3616 0.3758 3.9% unknown
BBA-AGJ-BCA 0.3497 0.3683 5.1% p=0.006
BDA-CI-BCA 0.3373 0.3586 5.9% p=0.006
Turpin & Moffat SIGIR 1999
Typical IR empirical systems paper
![Page 10: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )](https://reader035.vdocuments.us/reader035/viewer/2022062323/568158d6550346895dc61ef0/html5/thumbnails/10.jpg)
Fang et al SIGIR 2004
Monz et al SIGIR 2005
Shi et al SIGIR 2005Jordan et al JCDL June 2006
![Page 11: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )](https://reader035.vdocuments.us/reader035/viewer/2022062323/568158d6550346895dc61ef0/html5/thumbnails/11.jpg)
Implicit assumptionMore relevant documents high in the list is good
• Do users generally want more than one relevant document?
• Do users read lists top to bottom?• Who determines relevance? Binary?
Conditional or state-based?
• While MAP is tractable, does it reflect user experience?
• Is Yahoo! really better than Google, or vice-versa?
![Page 12: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )](https://reader035.vdocuments.us/reader035/viewer/2022062323/568158d6550346895dc61ef0/html5/thumbnails/12.jpg)
General Experiment
• Get a collection, set of queries, relevance judgments
• Compare System A and System B using MAP (Cranfield)
• Get users to do queries with System A or System B (balanced design…)
• Did the users do better with A or B?• Did the users prefer A or B?
![Page 13: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )](https://reader035.vdocuments.us/reader035/viewer/2022062323/568158d6550346895dc61ef0/html5/thumbnails/13.jpg)
Experiment 2000
24 Users Engine A
Engine B
MAP 0.275
IR 0.330
MAP 0.324
IR 0.3906 Queries
![Page 14: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )](https://reader035.vdocuments.us/reader035/viewer/2022062323/568158d6550346895dc61ef0/html5/thumbnails/14.jpg)
Experiment 2001
32 Users Engine A
Engine B
MAP 0.270
QA 66%
MAP 0.354
QA 60%8 Queries
![Page 15: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )](https://reader035.vdocuments.us/reader035/viewer/2022062323/568158d6550346895dc61ef0/html5/thumbnails/15.jpg)
Experiment 2005
• James Allan et al, UMass, SIGIR2005
• Passage retrieval and a recall task
• Used bpref, which “tracks MAP”
• Small benefit to users when bpref goes from – 0.50 to 0.60 and 0.90 to 0.95
• No benefit in the mid range 0.60 to 0.90
![Page 16: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )](https://reader035.vdocuments.us/reader035/viewer/2022062323/568158d6550346895dc61ef0/html5/thumbnails/16.jpg)
Predicted
Instance recall 81% 15% (p = 0.27)
Question answering 58% -6% (p = 0.41)
Actual
Experiments 2000, 2001, 2005
MAP
Exp 2005 20% 20%
16% 1%
50% 0%
Exp 2001
Exp 2002
![Page 17: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )](https://reader035.vdocuments.us/reader035/viewer/2022062323/568158d6550346895dc61ef0/html5/thumbnails/17.jpg)
Experiment 2006
32 Users
A
MAP 0.55
50 Queries
B
C
D
E
MAP 0.65
MAP 0.75
MAP 0.85
MAP 0.95
(100 documents)
![Page 18: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )](https://reader035.vdocuments.us/reader035/viewer/2022062323/568158d6550346895dc61ef0/html5/thumbnails/18.jpg)
Our Sheep
![Page 19: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )](https://reader035.vdocuments.us/reader035/viewer/2022062323/568158d6550346895dc61ef0/html5/thumbnails/19.jpg)
![Page 20: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )](https://reader035.vdocuments.us/reader035/viewer/2022062323/568158d6550346895dc61ef0/html5/thumbnails/20.jpg)
MAP
0.55 0.65 0.75 0.85 0.95
Tim
e (s
econ
ds)
5010
015
0 20
025
030
00
Time required to find first relevant document
![Page 21: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )](https://reader035.vdocuments.us/reader035/viewer/2022062323/568158d6550346895dc61ef0/html5/thumbnails/21.jpg)
Failures
0
5
10
15
20
25
55% 65% 75% 85% 95%
MAP
% o
f qu
erie
s w
ith n
o re
leva
nt a
nsw
er
![Page 22: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )](https://reader035.vdocuments.us/reader035/viewer/2022062323/568158d6550346895dc61ef0/html5/thumbnails/22.jpg)
“Better” MAP definition
![Page 23: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )](https://reader035.vdocuments.us/reader035/viewer/2022062323/568158d6550346895dc61ef0/html5/thumbnails/23.jpg)
Conclusion
• MAP does allow us to compare IR systems, but the assumption that an increase in MAP translates into an increase in user performance or satisfaction is not true– Supported by 4 different experiments
• Don’t automatically choose MAP as a metric– P@1 for Web style tasks?
![Page 24: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )](https://reader035.vdocuments.us/reader035/viewer/2022062323/568158d6550346895dc61ef0/html5/thumbnails/24.jpg)
P@1
P@10 1
Tim
e (s
econ
ds)
5010
015
0 20
025
030
00
![Page 25: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )](https://reader035.vdocuments.us/reader035/viewer/2022062323/568158d6550346895dc61ef0/html5/thumbnails/25.jpg)
0-10%10-20%
20-30%30-40%
40-50%50-60%
60-70%70-80%
80-90%90-100%
![Page 26: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )](https://reader035.vdocuments.us/reader035/viewer/2022062323/568158d6550346895dc61ef0/html5/thumbnails/26.jpg)
Rank of saved/viewed docs
![Page 27: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )](https://reader035.vdocuments.us/reader035/viewer/2022062323/568158d6550346895dc61ef0/html5/thumbnails/27.jpg)
Number of relevant found