search results need to be diverse mark sanderson university of sheffield
TRANSCRIPT
Aim
• Tell you about our test collection work in Sheffield
• How we’ve been having fun building test collections
04/18/23
3
Organising this is hard
• TREC•Donna, Ellen
• CLEF•Carol
• NTCIR•Noriko
• Make sure you enjoy it
04/18/23
4
ImageCLEF
• Cross language image retrieval
• Running for 6 years
• Photo
• Medical
• And other tasks
• Imageclef.org
04/18/23
5
How do we do it?
• Organise and conduct research
• imageCLEFPhoto 2008• Study diversity in search results
•Diversity?
04/18/23
6
Operational search engine
• Ambiguous queries•What is correct interpretation?
• Don’t know• Serve as diverse a range as possible
04/18/23
11
Diversity is studied
• Carbonell, J. and Goldstein, J. (1998) The use of MMR, diversity-based reranking for reordering documents and producing summaries. In ACM SIGIR, 335-336.
• Zhai, C. (2002) Risk Minimization and Language Modeling in Text Retrieval, PhD thesis, Carnegie Mellon University.
• Chen, H. and Karger, D. R. (2006) Less is more: probabilistic models for retrieving fewer relevant documents. In ACM SIGIR, 429-436.
Cluster hypothesis
• “closely associated documents tend to be relevant to the same requests”• Van Rijsbergen (1979)
04/18/23
13
Most test collections
• Focussed topic
• Relevance judgments•Who says what is relevant?
• (almost always) one person
•Consideration of interpretations• Little or none
• Gap between test and operation
Few test collections
• Hersh, W. R. and Over, P. (1999) Trec-8 interactive track report. TREC-8
• Over P. (1997) TREC-5 Interactive Track Report. TREC-5, 29-56
• Clarke, C. L., Kolla, M., Cormack, G. V., Vechtomova, O., Ashkan, A., Büttcher, S., and MacKinnon, I. (2008) Novelty and diversity in information retrieval evaluation. In ACM SIGIR.
04/18/23
15
Study diversity
• What sorts of diversity is there?• Ambiguous query words
• How often is it a feature of search?•How often are queries ambiguous?
• How can we add it into test collections?
04/18/23
16
Extent of diversity?
• “Ambiguous queries: test collections need more sense”, SIGIR 2008
• How do you define ambiguity?•Wikipedia
•WordNet
04/18/23
17
Wikipedia stats
• enwiki-20071018-pages-articles.xml• (12.7Gb)
• Disambiguation pages easy to spot• “_(disambiguation)” in title
Chicago
• “{{disambig}}” templateGeorge_bush
Query logs
Log Unique queries (all)
Most frequent (fr)
Year(s) gathered
Web 1,000,000 8,719 2006
PA 507,914 14,541 2006-7
Fraction of ambiguous
1 2 3Name Wi WN WN+Wi
Web freq 7.6% 4.0% 10.0%
all 2.5% 0.8% 3.0%PA freq 10.5% 6.4% 14.7%
all 2.1% 0.8% 2.7%
Conclusions
• Ambiguity is a problem
• Ambiguity is present in query logs•Not just Web search
• Ambiguity present?•Need for IR systems to produce diverse
results
04/18/23
23
ImageCLEFPhoto
• Build a test collection• Encourage the study of diversity
• Study how others deal with diversity
•Have some fun
04/18/23
25
Collection
• IAPR TC-12
• 20,000 travel photographs
• Text captions
• 60 existing topics•Used in two previous studies
• 39 used for diversity study
04/18/23
26
Types of diversity
• 22 geographical• “Churches in Brazil”
• 17 other• “Australian animals”
04/18/23
28
Relevance judgments
• Clustered existing qrels
• Multiple assessors•Good level of agreement on clusters
04/18/23
29
Evaluation
• Precision at 20• P(20)
• Fraction of relevant in top 20
• Cluster recall at 20•CR(20)
• Fraction of different clusters in top 20
04/18/23
30
Compare with past years
• Same 39 topics used in 2006, 2007• But without clustering
• Compare cluster recall on past runs• Based on identical P(20)
• Cluster recall increased• Substantially
• Significantly
04/18/23
33
Meta-analysis
• This was fun
• We experimented on participants outputs
• Not by design• Lucky accident
04/18/23
34
Not first to think of this
• Buckley and Voorhees• SIGIR 2000, 2002
• Use submitted runs to generate new research
04/18/23
35
Conduct user experiment
• Do users prefer diversity?
• Experiment• Build a system to do this
• Show users• your system• Baseline system
•Measure users
04/18/23
36
Why bother…
• …when others have done the work for you
• Pair up randomly sampled runs•High CR(20)
• Low CR(20)
• Show to users
04/18/23
37
User preferences
• 54.6% more diversified;
• 19.7% less diversified;
• 17.4% both were equal;
• 8.3% preferred neither.
04/18/23
40
Conclusions
• Diversity appears to be important
• System don’t do diversity by default
• Users prefer diverse results
• Test collections don’t support diversity• But can be adapted
04/18/23
41