examining the limits of crowdsourcing for relevance assessment paul clough...

44
Examining the limits of crowdsourcing for relevance assessment Paul Clough ([email protected]) Information School University of Sheffield Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 1

Upload: cody-rogers

Post on 24-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Examining the limits of crowdsourcing for relevance

assessmentPaul Clough

([email protected])

Information SchoolUniversity of Sheffield

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 1

Page 2: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 2

Page 3: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

• Information access• Computational drug discovery• Digital healthcare• Information literacy• Digital libraries• Search analytics• Research data management• User interfaces and interaction• Governance and open access• Social media analysis• Evaluation of information systems• Data mining and visualisation• Learning technologies• Collaborative knowledge sharing

• Diverse research profile

• Research questions around– Organisations– Society– Technical

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 3

Now offering an Msc Data Sciencehttp://www.sheffield.ac.uk/is/pgt/courses/data_science

Page 4: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

About the study

Clough, P., Sanderson, M., Tang, J., Gollins, T. and Warner, A. (2013) Examining the limits of crowdsourcing for relevance assessment, IEEE Internet Computing, 32-38 .

Improving Information Finding at The National Archives

“the outputs of the work were provided in such a focused and timely way that they have been able to dovetail with our system development activities this year and inform critical decisions in designing our new resource discovery system to the benefit of the organisation and ultimately the public that use it” Tim Gollins, The National Archives, Head Of Digital Preservation and Resource Discovery

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 4

Page 5: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Outline

• Background• Crowdsourcing• Methodology• Results• Discussion• Summary

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 5

Page 6: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

The National Archives

• The National Archives (TNA) is the UK Government’s official document repository– http://www.nationalarchives.gov.uk/

• Store and maintain records spanning over a thousand years in both physical and digital form

• Provide access to their extensive range of datasets via search functionalities (“search the archives”)– Electronic versions of paper documents– Catalogues describing the contents of TNA– Published information (e.g. London Gazette)– Other datasets (e.g. Your Archives wiki)

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 6

Page 7: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Page 7

Page 8: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

What TNA wanted from us

• The National Archives approached us to help with evaluating their search tools, including “search the archives”

• TNA were specifically interested in building re-useable test collections and measuring system effectiveness– To enable the tuning of their search algorithms/products– To allow comparison of different versions/evolutions of the

catalogue search• There were a number of constraints to the work, the main

one being limited resources – Therefore asked to think about cost-effective ways of creating

evaluation resources

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 8

Page 9: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

• IR systems ultimately to be used by people for some purpose and operating in an environment

• What makes an IR system successful?– Whether it retrieves ‘relevant’ documents– How quickly it returns results– How well it supports user interaction– Whether the user is satisfied with the results– How easily users can use the system– Whether the system helps users carry out tasks– Whether system impacts on the wider environment

Evaluating IR systems

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 9

Clough, P. and Goodale, P. (2013) Selecting Success Criteria: Experiences with an Academic Library Catalogue, In Proceedings of 4th Conference on Evaluation of Information Access Systems (CLEF 2013), 59-70.

Page 10: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Evaluating IR systems• Measurement usually carried out in controlled laboratory

experiments (with or without users)– Standardised benchmarks (test collections)– Lab-based experiments with users– Online using operational systems

• Effectiveness, efficiency, usability and cost are related– e.g., if we want a particular level of effectiveness and

efficiency, this will determine the cost of the system configuration

• Evaluation typically comparative (Systems A vs. B)• Most common evaluation: retrieval effectiveness• Evaluate during development or at the end

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 10

Page 11: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Real users, needs and situations

In situ (living lab)

Uncontrolled variables

Simulated users

Lab-based

Controlled variablesInform

Predict

Evaluation “landscape” (Kelly, 2009)

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 11

Page 12: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

• Test collections provide re-usable resources to evaluate IR systems in controlled lab setting– Collection of documents– Set of representative queries (topics)– Set of relevance judgments for each topic– Evaluation measures (system performance)

• Test collection + measures provides simulation of user in operation setting (if designed carefully)– Do results obtained with test collections predict user task

success or performance / satisfaction with results?– But what about beyond query-response paradigm?– How do you integrate contextual and situational factors?

‘Cranfield style’ test collection

Comparative system evaluation

IR test collections

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 12

Clough, P. and Sanderson, M. (2013) Evaluating the performance of information retrieval systems using test collections, Information Research, Volume 18(2), paper 582

Page 13: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Results

Judges

f( )

Search Engine

From tutorial on “Low-Cost Evaluation, Reliability and Reusability”, RuSSIR 2011, Evangelos Kanoulas

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 13

Page 14: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

• Gathering a collection of documents• Generating a suitable set of queries/topics

– How do I obtain the queries/topics?– How many queries/topics do I need?

• Creating the relevance assessments– How do I gather the assessments?– Who should do the assessments?– How many assessments should be made?– What are the assessors expected to do?– What about finding missing relevant documents?

• Selecting a suitable evaluation measure• These decisions will affect the quality of the benchmark and

impact on the accuracy/usefulness of results

IR test collections – practical issues

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 14

Page 15: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Page 15

Page 16: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

So in summary …

• Evaluation is critical for designing, developing and evolving effective IR systems

• Traditionally been a strong focus on measuring system effectiveness in controlled lab settings

• Cranfield methodology commonly used, where human judges manually assess documents for relevance to a range of queries

• However, gathering relevance assessments can be a real bottleneck, especially in real life settings

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 16

Page 17: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

• Crowdsourcing is act of taking job traditionally performed by designated person and outsourcing to undefined, generally large, group of people in the form of an open call

Crowdsourcing

http://www.wired.com/wired/archive/14.06/crowds.htmlForum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 17

Page 18: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

• Crowdsourcing recently made popular by development of Amazon Mechanical Turk (AMT) platform

• Artificial artificial intelligence– “For many tasks, humans outperform computers. So why

not farm out computing tasks to people, not machines?” http://www.economist.com/node/7001738?story_id=7001738

• Mechanical Turk– Requester creates Human Intelligence Tasks (HITs)– Workers chose to complete HITs– Requesters assess results and pay workers– Currently > 200,000 workers from many countries– http://www.behind-the-enemy-lines.com/2010/03/new-demographic

s-of-mechanical-turk.html

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 18

Page 19: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Example AMT task: tenthousandcents.com

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 19

Page 20: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Generating relevance assessments• Relevance assessment is time-consuming and causes

bottleneck in IR evaluation– Often requires input of domain experts– Pooling is commonly used to form sets of documents for

assessors to judge– Coverage of pools (and depth)

• More efficient approaches to judge relevance?– Move-to-Front (MTF) pooling– Interactive Search and Judge– Use of sampling techniques– Use of implicit judgments (e.g. from log data)– Crowdsourcing

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 20

Page 21: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Crowdsourced assessments

• Crowdsourcing has been used in various natural language processing tasks (Snow et al., 2006), for example– Comparing the output of MT systems– Assessing semantic similarity of words

• Crowdsourcing also shown to be feasible for assessing the relevance of IR system results, for example– Alonso & Mizzaro (2009) compared crowdsourced judgments

(using AMT) with judgments of TREC assessors– Showed that crowdsourcing was a reliable way of providing

relevance assessments, although creating a TREC-like experiment suitable for crowdsourcing required careful design and execution

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 21

Page 22: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Effects of domain expertise

• Bailey et al. (2008) compared the relevance assessments of judges who were subject experts and those who were not– Found that different assessments resulted, which had an effect on system

effectiveness– Distinctions between top performing systems were harder to discern based

on assessments from the judges who were not subject experts

• Kinney et al. (2008) also investigated domain expertise and found that non-specialist assessors made significant errors– Compared to experts, they disagreed on the underlying meaning of queries

and subsequently there were effects on calculations of system effectiveness.

– Found that the rating accuracy of generalists was improved if domain experts provided descriptions of what the user, issuing a query, was seeking

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 22

Page 23: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

So in summary …• Crowdsourcing has successfully been used in various

natural language processing tasks• Crowdsourcing shown to be feasible for relevance

assessment• However, previous studies also showed that domain

expertise can have effect on judgments• A topic that is less examined, however, are the

limitations of a crowdsourcing approach, especially within specialised domains, such as The National Archives

Question: will crowdsourced judgments for queries and documents covering a specialist domain be as accurate and correlate with the judgments of a domain expert?

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 23

Page 24: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Methodology

• Our study– Compare relevance assessments from crowdsourced workers with

those from domain expert– Measure impact on system effectiveness scores and relative rankings

of two search engines used to provide search across the archives at TNA (System A and System B)

• Domain expert– Worked with subject experts across the organisation to identify

answers to search queries– Assessor held degrees in History (BA and MA)– Head of Systems Development and Search at TNA for approximately

3 years examining how people searched the TNA website – Assessor worked at TNA for around 6.5 years

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 24

Page 25: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Experimental setup• 48 queries selected by a member of the search quality team at TNA

from analysing search logs and identifying frequent queries– Half “navigational” and half “informational”

• The TNA team member examined retrieved documents and from this wrote a 1-2 sentence description of the likely information need behind each query– This was also used to help assessors better understand the user’s intent and

in line with the recommendations of Kinney al. (2008)

• The 48 queries were issued to two search engines (System A and System B) and ten highest ranked documents retrieved and judged

• Precision at rank 10 (P@10) and Discounted Cumulative Gain measured at rank 10 (DCG@10) used to measure performance

• Workers’ scores aggregated by using average DCG and P@10 scores

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 25

Page 26: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Queries used in study

Informational Navigational"John Parmenter" lincolnshire regiment 1901 census medal rollsarklow Lockton 1911 census medalsassi 45 merchant navy army records militaryaylesford kent monks eleigh births military recordsBradshawe NAVY 1953 boer war naval recordsC15 palestine cabinet papers prison recordschurton pannal divorce royal navyClapham Common Rail 264 divorce records ufodavid morgan songer domesday war diariesfawcett togoland maps willsJohn William Baker victorious ship logs medal cards ww1knaresborough forest wicklow medal index cards ww2

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 26

Page 27: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Example queries with descriptions

“John Parmenter” (informational)

Any entries among The National Archives data which contain the term ‘John Parmenter’. User is probably searching for sources which are useful for Family History, in particular wills or military records, especially medal rolls.

1901 census (navigational)

Any result that allows users to find out information about the 1901 Census (or search its data). Most likely preferred destination is a link to http://www.1901censusonline.com/ - an external site which actually contains the Census data.

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 27

Page 28: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Design of Mechanical Turk experiment

• Each job performed by the crowdsource workers (known as a HIT, Human Intelligence Task) consisted of being shown– Query– Description of the query intent– 10 retrieved documents to be judged for relevance (output by

either system A or B)– Answering four post-task questions (using 5-point Likert scales)

• Graded relevance assessments were used to assess relevance– 0=not relevant– 1=partially relevant– 2=highly relevant

• Workers were encouraged to judge multiple queries• Ethics approval obtained!

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 28

Page 29: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

How it appeared on Mechanical Turk

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 29

Page 30: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Design of Mechanical Turk experiment

• We aimed to gather 10 sets of judgments per query-system combination resulting in a total of 960 HITs (48 x 2 x 10)– Workers were offered 4¢ per completed HIT– Total cost of the experiment was $43.20 (including admin fees) – Experiment was run over two weeks

• Data collected from 73 unique workers who produced 924 HITs (96.3% of the total available)– The ten most active workers completed approximately 80% of

the HITs• No gold standard data was used in the experiments; data was

manually checked for noise, which resulted in 91 HITs being eliminated from one worker

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 30

Page 31: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 31

Page 32: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Queries Q1- difficulty 1=V. difficult; 5=V. easy

Q2 - familiarity1=V. unfamiliar; 5=V. familiar

Q3 – confidence1=Not at all confident; 5=V. confident

Q4 – satisfaction1=V. unsatisfied; 5=V. satisfied

Expert All 4.36 4.34 4.25 3.25Informational 4.10 4.13 4.00 3.04Navigational 4.63 4.56 4.50 3.46System A 4.54 4.44 4.44 3.90System B 4.19 4.25 4.06 2.60

Crowd-sourced workers

All 3.47 3.54 4.12 4.13Informational 3.48 3.43 4.04 4.05Navigational 3.47 3.65 4.20 4.21System A 3.42 3.57 4.18 4.18System B 3.52 3.51 4.05 4.08

Questionnaire responses

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 32

Page 33: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Comparing effectiveness of System A vs. System B

P@10 DCGAll (N=48)

Informational (N=24)

Navigational (N=24)

Expert System A 0.81 7.43 6.59 8.26System B 0.51 4.39 4.66 4.11Average 0.66 5.91 5.63 6.19

Crowd-sourced workers

System A 0.77 6.40 5.60 7.21System B 0.63 5.32 4.93 5.71Average 0.70 5.86 5.27 6.46

With both sets of judgments, we can measure that system A is statistically significantly better than system B (using paired t-test) and although the absolute scores differ, the relative ranking of systems remains stable

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 33

Page 34: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

DCG results across all queries

The crowdsource workers rate system B as far better than it is compared to the expert causing the large differences in absolute scores: crowdsource workers are unable to detect poor performance (see also questionnaire results)

Informational Navigational

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 34

Page 35: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Comparing correlations between assessors

• Judgments between the crowdsourced workers and the expert are more interchangeable for system B than A, despite resulting differences in absolute scores

Queries DCG P@10system A All (N=48) 0.492** 0.285*

Informational 0.323 -0.029Navigational 0.485* 0.467*

system B All (N=48) 0.601** 0.595**Informational 0.563** 0.523**Navigational 0.772** 0.786**

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 35

Page 36: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

• Navigational queries correlate better than informational

• The lower correlation between expert and crowdsourced workers for system A, particularly for informational queries, suggests the results of a higher quality search engines are more difficult to assess using crowdsourcing

0.00

2.00

4.00

6.00

8.00

10.00

12.00

0.00 2.00 4.00 6.00 8.00 10.00

MTurker DCG score

Expe

rt D

CG s

core

Informational Navigational

0.00

2.00

4.00

6.00

8.00

10.00

12.00

0.00 2.00 4.00 6.00 8.00 10.00MTurker DCG score

Expe

rt D

CG s

core

Informational Navigational

System A (the better system) System B

Comparing judgments I

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 36

Page 37: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Comparing judgments II

• Multiple crowdsourced judgments aggregated to compute average score

• Compute measure of inter-rater agreement based on standard deviation across P@10 scores for all crowdsource workers

• Strong positive correlation indicating that the greater the variation between the crowdsourced judgments, the greater the difference between the average crowdsourced worker score and the expert

• Impact: could use this to predict how well workers will agree with expert judge

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 37

Page 38: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Difference between P@10 scores for System B (worse system)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Stdev of Turker P@10 scores

Diffe

renc

e Ex

pert

-Mur

ker P

@10

scor

e

InformationalNavigational

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 38

Page 39: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Discussion I

• Many organisations need reliable and repeatable methodologies for evaluating search services– Crowdsourcing potentially lowers the cost and offer an

efficient way of collecting relevance assessments• Cost of relevance assessments for our study

– Crowdsource workers: cost $43.00 for 45 hrs 13 mins of assessor effort purchased on AMT

– Expert assessor: cost $106.02 for 3 hrs 5 mins of work• Our results suggest that crowdsourced assessments and

those generated by an expert produce similar results when ranking system A and B (A was better)

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 39

Page 40: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Discussion II

• However limits were found to using crowdsource workers for assessing relevance within TNA setting– Correlations between the crowdsourced workers and the

expert assessor are lower for certain kinds of queries – In our case informational queries, in which the assessor is

judging a more subject-oriented type of search, seem to be more difficult for assessors on a system that performs better (system A)

– Although absolute scores for system B were higher based on crowdsourced worker judgments than the expert, it would seem likely that results from a poorer performing system were easier to judge

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 40

Page 41: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Summary I

• Results indicate that crowdsource workers were good enough for the job (in hand) of correctly ranking the two search engines based on retrieval effectiveness

• Absolute scores differ but relative rankings remain stable across assessor groups (Cleverdon, 1970; Lesk & Salton, 1968; Voorhees, 1998)

• Crowdsourcing seems a viable method of gathering relevance assessments (Alonso & Mizzaro, 2008; Carvalho et al., 2011; Hosseini et al., 2012)

• But there are disagreements between assessors linked with domain expertise (Bailey et al., 2008; Kinney et al., 2008)– Informational vs. navigational queries– For better performing system than poorer one

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 41

Page 42: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Summary II

• The National Archives are very interested in the results due to constraints on resources and the need to evaluate their IR system

• Crowdsourcing is a practical and efficient approach of gathering relevance assessments

• However, they are unlikely to use an unknown crowd but rather known user groups with more expertise and domain knowledge– For example, volunteer groups and enthusiasts of the

National Archives and their content

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 42

Page 43: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Future work• Extensive dataset which would provide basis for further investigations

– “Failure analysis” on specific queries and assessors (e.g. look for assessors whose judgments better match expert)

– Use current data as training data to identify incorrect judgments or poor assessors (e.g. filtering results by levels of expertise)

– Experiment with different approaches to aggregating judgments from crowdsource workers (at present we used average of system effectiveness scores; could use voting approach)

• Experiment with alternative user groups with varying levels of domain expertise– Maybe use crowdsourcing to judge some queries; expert others

• Experiment with larger and more varied sets of queries (esp. rarer queries)

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 43

Page 44: Examining the limits of crowdsourcing for relevance assessment Paul Clough (p.d.clough@sheffield.ac.uk) Information School University of Sheffield Forum

Thanks for listening

Any questions?

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014 Page 44

Acknowledgements for funding and support