examining the limits of crowdsourcing for relevance assessment paul clough...

Examining the limits of crowdsourcing for relevance

assessmentPaul Clough

([email protected])

Information SchoolUniversity of Sheffield

Forum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014

• Information access• Computational drug discovery• Digital healthcare• Information literacy• Digital libraries• Search analytics• Research data management• User interfaces and interaction• Governance and open access• Social media analysis• Evaluation of information systems• Data mining and visualisation• Learning technologies• Collaborative knowledge sharing

• Diverse research profile

• Research questions around– Organisations– Society– Technical


Now offering an Msc Data Sciencehttp://www.sheffield.ac.uk/is/pgt/courses/data_science

http://www.sheffield.ac.uk/is/pgt/courses/data_science



About the study

Clough, P., Sanderson, M., Tang, J., Gollins, T. and Warner, A. (2013) Examining the limits of crowdsourcing for relevance assessment, IEEE Internet Computing, 32-38 .

Improving Information Finding at The National Archives

“the outputs of the work were provided in such a focused and timely way that they have been able to dovetail with our system development activities this year and inform critical decisions in designing our new resource discovery system to the benefit of the organisation and ultimately the public that use it” Tim Gollins, The National Archives, Head Of Digital Preservation and Resource Discovery


Outline

• Background• Crowdsourcing• Methodology• Results• Discussion• Summary


The National Archives

• The National Archives (TNA) is the UK Government’s official document repository– http://www.nationalarchives.gov.uk/

• Store and maintain records spanning over a thousand years in both physical and digital form

• Provide access to their extensive range of datasets via search functionalities (“search the archives”)– Electronic versions of paper documents– Catalogues describing the contents of TNA– Published information (e.g. London Gazette)– Other datasets (e.g. Your Archives wiki)


http://www.nationalarchives.gov.uk/

What TNA wanted from us

• The National Archives approached us to help with evaluating their search tools, including “search the archives”

• TNA were specifically interested in building re-useable test collections and measuring system effectiveness– To enable the tuning of their search algorithms/products– To allow comparison of different versions/evolutions of the

catalogue search• There were a number of constraints to the work, the main

one being limited resources – Therefore asked to think about cost-effective ways of creating

evaluation resources


• IR systems ultimately to be used by people for some purpose and operating in an environment

• What makes an IR system successful?– Whether it retrieves ‘relevant’ documents– How quickly it returns results– How well it supports user interaction– Whether the user is satisfied with the results– How easily users can use the system– Whether the system helps users carry out tasks– Whether system impacts on the wider environment

Evaluating IR systems


Clough, P. and Goodale, P. (2013) Selecting Success Criteria: Experiences with an Academic Library Catalogue, In Proceedings of 4th Conference on Evaluation of Information Access Systems (CLEF 2013), 59-70.

Evaluating IR systems• Measurement usually carried out in controlled laboratory

experiments (with or without users)– Standardised benchmarks (test collections)– Lab-based experiments with users– Online using operational systems

• Effectiveness, efficiency, usability and cost are related– e.g., if we want a particular level of effectiveness and

efficiency, this will determine the cost of the system configuration

• Evaluation typically comparative (Systems A vs. B)• Most common evaluation: retrieval effectiveness• Evaluate during development or at the end


Real users, needs and situations

In situ (living lab)

Uncontrolled variables

Simulated users

Lab-based

Controlled variablesInform

Predict

Evaluation “landscape” (Kelly, 2009)


• Test collections provide re-usable resources to evaluate IR systems in controlled lab setting– Collection of documents– Set of representative queries (topics)– Set of relevance judgments for each topic– Evaluation measures (system performance)

• Test collection + measures provides simulation of user in operation setting (if designed carefully)– Do results obtained with test collections predict user task

success or performance / satisfaction with results?– But what about beyond query-response paradigm?– How do you integrate contextual and situational factors?

‘Cranfield style’ test collection

Comparative system evaluation

IR test collections


Clough, P. and Sanderson, M. (2013) Evaluating the performance of information retrieval systems using test collections, Information Research, Volume 18(2), paper 582

Results

Judges

f( )

Search Engine

From tutorial on “Low-Cost Evaluation, Reliability and Reusability”, RuSSIR 2011, Evangelos Kanoulas


• Gathering a collection of documents• Generating a suitable set of queries/topics

– How do I obtain the queries/topics?– How many queries/topics do I need?

• Creating the relevance assessments– How do I gather the assessments?– Who should do the assessments?– How many assessments should be made?– What are the assessors expected to do?– What about finding missing relevant documents?

• Selecting a suitable evaluation measure• These decisions will affect the quality of the benchmark and

impact on the accuracy/usefulness of results

IR test collections – practical issues


So in summary …

• Evaluation is critical for designing, developing and evolving effective IR systems

• Traditionally been a strong focus on measuring system effectiveness in controlled lab settings

• Cranfield methodology commonly used, where human judges manually assess documents for relevance to a range of queries

• However, gathering relevance assessments can be a real bottleneck, especially in real life settings


• Crowdsourcing is act of taking job traditionally performed by designated person and outsourcing to undefined, generally large, group of people in the form of an open call

Crowdsourcing

http://www.wired.com/wired/archive/14.06/crowds.htmlForum for Information Retrieval Evaluation 2014, Bangalore, 5-7 December 2014

http://www.wired.com/wired/archive/14.06/crowds.html

• Crowdsourcing recently made popular by development of Amazon Mechanical Turk (AMT) platform

• Artificial artificial intelligence– “For many tasks, humans outperform computers. So why

not farm out computing tasks to people, not machines?” http://www.economist.com/node/7001738?story_id=7001738

• Mechanical Turk– Requester creates Human Intelligence Tasks (HITs)– Workers chose to complete HITs– Requesters assess results and pay workers– Currently > 200,000 workers from many countries– http://www.behind-the-enemy-lines.com/2010/03/new-demographic

s-of-mechanical-turk.html


http://www.economist.com/node/7001738?story_id=7001738

http://www.behind-the-enemy-lines.com/2010/03/new-demographics-of-mechanical-turk.html

http://www.behind-the-enemy-lines.com/2010/03/new-demographics-of-mechanical-turk.html

Example AMT task: tenthousandcents.com


Generating relevance assessments• Relevance assessment is time-consuming and causes

bottleneck in IR evaluation– Often requires input of domain experts– Pooling is commonly used to form sets of documents for

assessors to judge– Coverage of pools (and depth)

• More efficient approaches to judge relevance?– Move-to-Front (MTF) pooling– Interactive Search and Judge– Use of sampling techniques– Use of implicit judgments (e.g. from log data)– Crowdsourcing


Crowdsourced assessments

• Crowdsourcing has been used in various natural language processing tasks (Snow et al., 2006), for example– Comparing the output of MT systems– Assessing semantic similarity of words

• Crowdsourcing also shown to be feasible for assessing the relevance of IR system results, for example– Alonso & Mizzaro (2009) compared crowdsourced judgments

(using AMT) with judgments of TREC assessors– Showed that crowdsourcing was a reliable way of providing

relevance assessments, although creating a TREC-like experiment suitable for crowdsourcing required careful design and execution


Effects of domain expertise

• Bailey et al. (2008) compared the relevance assessments of judges who were subject experts and those who were not– Found that different assessments resulted, which had an effect on system

effectiveness– Distinctions between top performing systems were harder to discern based

on assessments from the judges who were not subject experts

• Kinney et al. (2008) also investigated domain expertise and found that non-specialist assessors made significant errors– Compared to experts, they disagreed on the underlying meaning of queries

and subsequently there were effects on calculations of system effectiveness.

– Found that the rating accuracy of generalists was improved if domain experts provided descriptions of what the user, issuing a query, was seeking


So in summary …• Crowdsourcing has successfully been used in various

natural language processing tasks• Crowdsourcing shown to be feasible for relevance

assessment• However, previous studies also showed that domain

expertise can have effect on judgments• A topic that is less examined, however, are the

limitations of a crowdsourcing approach, especially within specialised domains, such as The National Archives

Question: will crowdsourced judgments for queries and documents covering a specialist domain be as accurate and correlate with the judgments of a domain expert?


Methodology

• Our study– Compare relevance assessments from crowdsourced workers with

those from domain expert– Measure impact on system effectiveness scores and relative rankings

of two search engines used to provide search across the archives at TNA (System A and System B)

• Domain expert– Worked with subject experts across the organisation to identify

answers to search queries– Assessor held degrees in History (BA and MA)– Head of Systems Development and Search at TNA for approximately

3 years examining how people searched the TNA website – Assessor worked at TNA for around 6.5 years


Experimental setup• 48 queries selected by a member of the search quality team at TNA

from analysing search logs and identifying frequent queries– Half “navigational” and half “informational”

• The TNA team member examined retrieved documents and from this wrote a 1-2 sentence description of the likely information need behind each query– This was also used to help assessors better understand the user’s intent and

in line with the recommendations of Kinney al. (2008)

• The 48 queries were issued to two search engines (System A and System B) and ten highest ranked documents retrieved and judged

• Precision at rank 10 (P@10) and Discounted Cumulative Gain measured at rank 10 (DCG@10) used to measure performance

• Workers’ scores aggregated by using average DCG and P@10 scores


Queries used in study

Informational Navigational"John Parmenter" lincolnshire regiment 1901 census medal rollsarklow Lockton 1911 census medalsassi 45 merchant navy army records militaryaylesford kent monks eleigh births military recordsBradshawe NAVY 1953 boer war naval recordsC15 palestine cabinet papers prison recordschurton pannal divorce royal navyClapham Common Rail 264 divorce records ufodavid morgan songer domesday war diariesfawcett togoland maps willsJohn William Baker victorious ship logs medal cards ww1knaresborough forest wicklow medal index cards ww2


Example queries with descriptions

“John Parmenter” (informational)

Any entries among The National Archives data which contain the term ‘John Parmenter’. User is probably searching for sources which are useful for Family History, in particular wills or military records, especially medal rolls.

1901 census (navigational)

Any result that allows users to find out information about the 1901 Census (or search its data). Most likely preferred destination is a link to http://www.1901censusonline.com/ - an external site which actually contains the Census data.


Design of Mechanical Turk experiment

• Each job performed by the crowdsource workers (known as a HIT, Human Intelligence Task) consisted of being shown– Query– Description of the query intent– 10 retrieved documents to be judged for relevance (output by

either system A or B)– Answering four post-task questions (using 5-point Likert scales)

• Graded relevance assessments were used to assess relevance– 0=not relevant– 1=partially relevant– 2=highly relevant

• Workers were encouraged to judge multiple queries• Ethics approval obtained!


How it appeared on Mechanical Turk


Design of Mechanical Turk experiment

• We aimed to gather 10 sets of judgments per query-system combination resulting in a total of 960 HITs (48 x 2 x 10)– Workers were offered 4¢ per completed HIT– Total cost of the experiment was $43.20 (including admin fees) – Experiment was run over two weeks

• Data collected from 73 unique workers who produced 924 HITs (96.3% of the total available)– The ten most active workers completed approximately 80% of

the HITs• No gold standard data was used in the experiments; data was

manually checked for noise, which resulted in 91 HITs being eliminated from one worker


Queries Q1- difficulty 1=V. difficult; 5=V. easy

Q2 - familiarity1=V. unfamiliar; 5=V. familiar

Q3 – confidence1=Not at all confident; 5=V. confident

Q4 – satisfaction1=V. unsatisfied; 5=V. satisfied

Expert All 4.36 4.34 4.25 3.25Informational 4.10 4.13 4.00 3.04Navigational 4.63 4.56 4.50 3.46System A 4.54 4.44 4.44 3.90System B 4.19 4.25 4.06 2.60

Crowd-sourced workers

All 3.47 3.54 4.12 4.13Informational 3.48 3.43 4.04 4.05Navigational 3.47 3.65 4.20 4.21System A 3.42 3.57 4.18 4.18System B 3.52 3.51 4.05 4.08

Questionnaire responses


Comparing effectiveness of System A vs. System B

P@10 DCGAll (N=48)

Informational (N=24)

Navigational (N=24)

Expert System A 0.81 7.43 6.59 8.26System B 0.51 4.39 4.66 4.11Average 0.66 5.91 5.63 6.19

Crowd-sourced workers

System A 0.77 6.40 5.60 7.21System B 0.63 5.32 4.93 5.71Average 0.70 5.86 5.27 6.46

With both sets of judgments, we can measure that system A is statistically significantly better than system B (using paired t-test) and although the absolute scores differ, the relative ranking of systems remains stable


DCG results across all queries

The crowdsource workers rate system B as far better than it is compared to the expert causing the large differences in absolute scores: crowdsource workers are unable to detect poor performance (see also questionnaire results)

Informational Navigational


Comparing correlations between assessors

• Judgments between the crowdsourced workers and the expert are more interchangeable for system B than A, despite resulting differences in absolute scores

Queries DCG P@10system A All (N=48) 0.492** 0.285*

Informational 0.323 -0.029Navigational 0.485* 0.467*

system B All (N=48) 0.601** 0.595**Informational 0.563** 0.523**Navigational 0.772** 0.786**


• Navigational queries correlate better than informational

• The lower correlation between expert and crowdsourced workers for system A, particularly for informational queries, suggests the results of a higher quality search engines are more difficult to assess using crowdsourcing

0.00

2.00

4.00

6.00

8.00

10.00

12.00

0.00 2.00 4.00 6.00 8.00 10.00

MTurker DCG score

Expe

rt D

CG s

core


0.00

2.00

4.00

6.00

8.00

10.00

12.00

0.00 2.00 4.00 6.00 8.00 10.00MTurker DCG score

Expe

rt D

CG s

core


System A (the better system) System B

Comparing judgments I


Comparing judgments II

• Multiple crowdsourced judgments aggregated to compute average score

• Compute measure of inter-rater agreement based on standard deviation across P@10 scores for all crowdsource workers

• Strong positive correlation indicating that the greater the variation between the crowdsourced judgments, the greater the difference between the average crowdsourced worker score and the expert

• Impact: could use this to predict how well workers will agree with expert judge


Difference between P@10 scores for System B (worse system)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Stdev of Turker P@10 scores

Diffe

renc

e Ex

pert

-Mur

ker P

@10

scor

e

InformationalNavigational


Discussion I

• Many organisations need reliable and repeatable methodologies for evaluating search services– Crowdsourcing potentially lowers the cost and offer an

efficient way of collecting relevance assessments• Cost of relevance assessments for our study

– Crowdsource workers: cost $43.00 for 45 hrs 13 mins of assessor effort purchased on AMT

– Expert assessor: cost $106.02 for 3 hrs 5 mins of work• Our results suggest that crowdsourced assessments and

those generated by an expert produce similar results when ranking system A and B (A was better)


Discussion II

• However limits were found to using crowdsource workers for assessing relevance within TNA setting– Correlations between the crowdsourced workers and the

expert assessor are lower for certain kinds of queries – In our case informational queries, in which the assessor is

judging a more subject-oriented type of search, seem to be more difficult for assessors on a system that performs better (system A)

– Although absolute scores for system B were higher based on crowdsourced worker judgments than the expert, it would seem likely that results from a poorer performing system were easier to judge


Summary I

• Results indicate that crowdsource workers were good enough for the job (in hand) of correctly ranking the two search engines based on retrieval effectiveness

• Absolute scores differ but relative rankings remain stable across assessor groups (Cleverdon, 1970; Lesk & Salton, 1968; Voorhees, 1998)

• Crowdsourcing seems a viable method of gathering relevance assessments (Alonso & Mizzaro, 2008; Carvalho et al., 2011; Hosseini et al., 2012)

• But there are disagreements between assessors linked with domain expertise (Bailey et al., 2008; Kinney et al., 2008)– Informational vs. navigational queries– For better performing system than poorer one


Summary II

• The National Archives are very interested in the results due to constraints on resources and the need to evaluate their IR system

• Crowdsourcing is a practical and efficient approach of gathering relevance assessments

• However, they are unlikely to use an unknown crowd but rather known user groups with more expertise and domain knowledge– For example, volunteer groups and enthusiasts of the

National Archives and their content


Future work• Extensive dataset which would provide basis for further investigations

– “Failure analysis” on specific queries and assessors (e.g. look for assessors whose judgments better match expert)

– Use current data as training data to identify incorrect judgments or poor assessors (e.g. filtering results by levels of expertise)

– Experiment with different approaches to aggregating judgments from crowdsource workers (at present we used average of system effectiveness scores; could use voting approach)

• Experiment with alternative user groups with varying levels of domain expertise– Maybe use crowdsourcing to judge some queries; expert others

• Experiment with larger and more varied sets of queries (esp. rarer queries)


Thanks for listening

Any questions?


Acknowledgements for funding and support

examining the limits of crowdsourcing for relevance assessment paul clough...

Documents

evaluation resources

national archives tna

archives wiki forum

resource discovery forum

search functionalities

catalogue search

search tools

digital form