search results need to be diverse mark sanderson university of sheffield

Search Results Need to be Diverse

Mark SandersonUniversity of Sheffield

How to have fun while running an evaluation campaignMark SandersonUniversity of Sheffield

Aim

• Tell you about our test collection work in Sheffield

• How we’ve been having fun building test collections

04/18/23

3

Organising this is hard

• TREC•Donna, Ellen

• CLEF•Carol

• NTCIR•Noriko

• Make sure you enjoy it

04/18/23

4

ImageCLEF

• Cross language image retrieval

• Running for 6 years

• Photo

• Medical

• And other tasks

• Imageclef.org

04/18/23

5

How do we do it?

• Organise and conduct research

• imageCLEFPhoto 2008• Study diversity in search results

•Diversity?

04/18/23

6

SIGIR

04/18/23

7

ACL

04/18/23

8

Mark Sanderson

04/18/23

9

Cranfield model

04/18/23

10

Operational search engine

• Ambiguous queries•What is correct interpretation?

• Don’t know• Serve as diverse a range as possible

04/18/23

11

Diversity is studied

• Carbonell, J. and Goldstein, J. (1998) The use of MMR, diversity-based reranking for reordering documents and producing summaries. In ACM SIGIR, 335-336.

• Zhai, C. (2002) Risk Minimization and Language Modeling in Text Retrieval, PhD thesis, Carnegie Mellon University.

• Chen, H. and Karger, D. R. (2006) Less is more: probabilistic models for retrieving fewer relevant documents. In ACM SIGIR, 429-436.

Cluster hypothesis

• “closely associated documents tend to be relevant to the same requests”• Van Rijsbergen (1979)

04/18/23

13

Most test collections

• Focussed topic

• Relevance judgments•Who says what is relevant?

• (almost always) one person

•Consideration of interpretations• Little or none

• Gap between test and operation

Few test collections

• Hersh, W. R. and Over, P. (1999) Trec-8 interactive track report. TREC-8

• Over P. (1997) TREC-5 Interactive Track Report. TREC-5, 29-56

• Clarke, C. L., Kolla, M., Cormack, G. V., Vechtomova, O., Ashkan, A., Büttcher, S., and MacKinnon, I. (2008) Novelty and diversity in information retrieval evaluation. In ACM SIGIR.

04/18/23

15

Study diversity

• What sorts of diversity is there?• Ambiguous query words

• How often is it a feature of search?•How often are queries ambiguous?

• How can we add it into test collections?

04/18/23

16

Extent of diversity?

• “Ambiguous queries: test collections need more sense”, SIGIR 2008

• How do you define ambiguity?•Wikipedia

•WordNet

04/18/23

17

Disambiguation page

04/18/23

18

Wikipedia stats

• enwiki-20071018-pages-articles.xml• (12.7Gb)

• Disambiguation pages easy to spot• “_(disambiguation)” in title

Chicago

• “{{disambig}}” templateGeorge_bush

Conventional source

• Downloaded WordNet v3.0

• 88K words

04/18/23

20

Query logs

Log Unique queries (all)

Most frequent (fr)

Year(s) gathered

Web 1,000,000 8,719 2006

PA 507,914 14,541 2006-7

Fraction of ambiguous

1 2 3Name Wi WN WN+Wi

Web freq 7.6% 4.0% 10.0%

all 2.5% 0.8% 3.0%PA freq 10.5% 6.4% 14.7%

all 2.1% 0.8% 2.7%

Conclusions

• Ambiguity is a problem

• Ambiguity is present in query logs•Not just Web search

• Ambiguity present?•Need for IR systems to produce diverse

results

04/18/23

23

Test collections

• Don’t test for diversity

• Do search systems deal with it?

04/18/23

24

ImageCLEFPhoto

• Build a test collection• Encourage the study of diversity

• Study how others deal with diversity

•Have some fun

04/18/23

25

Collection

• IAPR TC-12

• 20,000 travel photographs

• Text captions

• 60 existing topics•Used in two previous studies

• 39 used for diversity study

04/18/23

26

Diversity needs in topic

• “Images of typical Australian animals”

04/18/23

27

Types of diversity

• 22 geographical• “Churches in Brazil”

• 17 other• “Australian animals”

04/18/23

28

Relevance judgments

• Clustered existing qrels

• Multiple assessors•Good level of agreement on clusters

04/18/23

29

Evaluation

• Precision at 20• P(20)

• Fraction of relevant in top 20

• Cluster recall at 20•CR(20)

• Fraction of different clusters in top 20

04/18/23

30

Track was popular

• 24 groups

• 200 runs in total

04/18/23

31

Submitted runs

04/18/23

32

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.0 0.1 0.2 0.3 0.4 0.5 0.6

CR20

P20

Compare with past years

• Same 39 topics used in 2006, 2007• But without clustering

• Compare cluster recall on past runs• Based on identical P(20)

• Cluster recall increased• Substantially

• Significantly

04/18/23

33

Meta-analysis

• This was fun

• We experimented on participants outputs

• Not by design• Lucky accident

04/18/23

34

Not first to think of this

• Buckley and Voorhees• SIGIR 2000, 2002

• Use submitted runs to generate new research

04/18/23

35

Conduct user experiment

• Do users prefer diversity?

• Experiment• Build a system to do this

• Show users• your system• Baseline system

•Measure users

04/18/23

36

Why bother…

• …when others have done the work for you

• Pair up randomly sampled runs•High CR(20)

• Low CR(20)

• Show to users

04/18/23

37

Animals swimming

04/18/23

38

Numbers

• 25 topics

• 31 users

• 775 result pairs compared

04/18/23

39

User preferences

• 54.6% more diversified;

• 19.7% less diversified;

• 17.4% both were equal;

• 8.3% preferred neither.

04/18/23

40

Conclusions

• Diversity appears to be important

• System don’t do diversity by default

• Users prefer diverse results

• Test collections don’t support diversity• But can be adapted

04/18/23

41

and

• Organising evaluation campaigns is rewarding• And can generate novel research

04/18/23

42

search results need to be diverse mark sanderson university of sheffield

Documents

bush slide

operation slide

test collections dont

search results diversity

study diversity

test collections hersh

sorts of diversity

extent of diversity