search results need to be diverse mark sanderson university of sheffield

42
Search Results Need to be Diverse Mark Sanderson University of Sheffield

Upload: doris-williamson

Post on 16-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Search Results Need to be Diverse

Mark SandersonUniversity of Sheffield

How to have fun while running an evaluation campaignMark SandersonUniversity of Sheffield

Aim

• Tell you about our test collection work in Sheffield

• How we’ve been having fun building test collections

04/18/23

3

Organising this is hard

• TREC•Donna, Ellen

• CLEF•Carol

• NTCIR•Noriko

• Make sure you enjoy it

04/18/23

4

ImageCLEF

• Cross language image retrieval

• Running for 6 years

• Photo

• Medical

• And other tasks

• Imageclef.org

04/18/23

5

How do we do it?

• Organise and conduct research

• imageCLEFPhoto 2008• Study diversity in search results

•Diversity?

04/18/23

6

SIGIR

04/18/23

7

ACL

04/18/23

8

Mark Sanderson

04/18/23

9

Cranfield model

04/18/23

10

Operational search engine

• Ambiguous queries•What is correct interpretation?

• Don’t know• Serve as diverse a range as possible

04/18/23

11

Diversity is studied

• Carbonell, J. and Goldstein, J. (1998) The use of MMR, diversity-based reranking for reordering documents and producing summaries. In ACM SIGIR, 335-336.

• Zhai, C. (2002) Risk Minimization and Language Modeling in Text Retrieval, PhD thesis, Carnegie Mellon University.

• Chen, H. and Karger, D. R. (2006) Less is more: probabilistic models for retrieving fewer relevant documents. In ACM SIGIR, 429-436.

Cluster hypothesis

• “closely associated documents tend to be relevant to the same requests”• Van Rijsbergen (1979)

04/18/23

13

Most test collections

• Focussed topic

• Relevance judgments•Who says what is relevant?

• (almost always) one person

•Consideration of interpretations• Little or none

• Gap between test and operation

Few test collections

• Hersh, W. R. and Over, P. (1999) Trec-8 interactive track report. TREC-8

• Over P. (1997) TREC-5 Interactive Track Report. TREC-5, 29-56

• Clarke, C. L., Kolla, M., Cormack, G. V., Vechtomova, O., Ashkan, A., Büttcher, S., and MacKinnon, I. (2008) Novelty and diversity in information retrieval evaluation. In ACM SIGIR.

04/18/23

15

Study diversity

• What sorts of diversity is there?• Ambiguous query words

• How often is it a feature of search?•How often are queries ambiguous?

• How can we add it into test collections?

04/18/23

16

Extent of diversity?

• “Ambiguous queries: test collections need more sense”, SIGIR 2008

• How do you define ambiguity?•Wikipedia

•WordNet

04/18/23

17

Disambiguation page

04/18/23

18

Wikipedia stats

• enwiki-20071018-pages-articles.xml• (12.7Gb)

• Disambiguation pages easy to spot• “_(disambiguation)” in title

Chicago

• “{{disambig}}” templateGeorge_bush

Conventional source

• Downloaded WordNet v3.0

• 88K words

04/18/23

20

Query logs

Log Unique queries (all)

Most frequent (fr)

Year(s) gathered

Web 1,000,000 8,719 2006

PA 507,914 14,541 2006-7

Fraction of ambiguous

1 2 3Name Wi WN WN+Wi

Web freq 7.6% 4.0% 10.0%

all 2.5% 0.8% 3.0%PA freq 10.5% 6.4% 14.7%

all 2.1% 0.8% 2.7%

Conclusions

• Ambiguity is a problem

• Ambiguity is present in query logs•Not just Web search

• Ambiguity present?•Need for IR systems to produce diverse

results

04/18/23

23

Test collections

• Don’t test for diversity

• Do search systems deal with it?

04/18/23

24

ImageCLEFPhoto

• Build a test collection• Encourage the study of diversity

• Study how others deal with diversity

•Have some fun

04/18/23

25

Collection

• IAPR TC-12

• 20,000 travel photographs

• Text captions

• 60 existing topics•Used in two previous studies

• 39 used for diversity study

04/18/23

26

Diversity needs in topic

• “Images of typical Australian animals”

04/18/23

27

Types of diversity

• 22 geographical• “Churches in Brazil”

• 17 other• “Australian animals”

04/18/23

28

Relevance judgments

• Clustered existing qrels

• Multiple assessors•Good level of agreement on clusters

04/18/23

29

Evaluation

• Precision at 20• P(20)

• Fraction of relevant in top 20

• Cluster recall at 20•CR(20)

• Fraction of different clusters in top 20

04/18/23

30

Track was popular

• 24 groups

• 200 runs in total

04/18/23

31

Submitted runs

04/18/23

32

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.0 0.1 0.2 0.3 0.4 0.5 0.6

CR20

P20

Compare with past years

• Same 39 topics used in 2006, 2007• But without clustering

• Compare cluster recall on past runs• Based on identical P(20)

• Cluster recall increased• Substantially

• Significantly

04/18/23

33

Meta-analysis

• This was fun

• We experimented on participants outputs

• Not by design• Lucky accident

04/18/23

34

Not first to think of this

• Buckley and Voorhees• SIGIR 2000, 2002

• Use submitted runs to generate new research

04/18/23

35

Conduct user experiment

• Do users prefer diversity?

• Experiment• Build a system to do this

• Show users• your system• Baseline system

•Measure users

04/18/23

36

Why bother…

• …when others have done the work for you

• Pair up randomly sampled runs•High CR(20)

• Low CR(20)

• Show to users

04/18/23

37

Animals swimming

04/18/23

38

Numbers

• 25 topics

• 31 users

• 775 result pairs compared

04/18/23

39

User preferences

• 54.6% more diversified;

• 19.7% less diversified;

• 17.4% both were equal;

• 8.3% preferred neither.

04/18/23

40

Conclusions

• Diversity appears to be important

• System don’t do diversity by default

• Users prefer diverse results

• Test collections don’t support diversity• But can be adapted

04/18/23

41

and

• Organising evaluation campaigns is rewarding• And can generate novel research

04/18/23

42