a little competition never hurt anyone’s relevance assessmentsceur-ws.org/vol-1642/paper5.pdf ·...

8
A Little Competition Never Hurt Anyone’s Relevance Assessments Yuan Jin, Mark J. Carman Faculty of Information Technology Monash University {yuan.jin,mark.carman}@monash.edu Lexing Xie Research School of Computer Science Australian National University [email protected] Abstract This paper investigates the eect of real-time performance feedback and competition on the accuracy of crowd-workers for a document rel- evance assessment task. Through a series of controlled tests, we show that displaying a leaderboard to crowd-workers can motivate them to improve their performance, provid- ing a bonus is oered to the best performing workers. This eect is observed even when test questions are used to enforce quality con- trol during task completion. 1 Introduction Crowd-sourcing involves enlisting the skills and exper- tise of many individuals (usually over the Internet) to achieve tasks that require human-level intelligence. Paid crowd-sourcing platforms, such as Amazon’s Me- chanical Turk 1 and CrowdFlower 2 , allow for the low- cost outsourcing of labelling tasks such as judging doc- ument relevance to Web search queries. In this paper we investigate paid crowd-sourcing ap- plied for the collection of relevance judgements for In- formation Retrieval tasks and look to better under- stand the role performance feedback and competition can play in motivating crowd-workers to provide high- quality judgements. More specifically, we seek to sys- tematically answer the following research questions : Copyright c by the paper’s authors. Copying permitted for private and academic purposes. In: F. Hopfgartner, G. Kazai, U. Kruschwitz, and M. Meder (eds.): Proceedings of the GamifIR 2016 Workshop, Pisa, Italy, 21-July-2016, published at http://ceur-ws.org 1 https://www.mturk.com/mturk/welcome 2 https://www.crowdflower.com/ Does providing real-time feedback to crowd- workers regarding their relevance judging accu- racy aect their average performance? And if so, does their performance improve or worsen? Does the competition between crowd-workers that results from providing a leaderboard to them aect their performance? And if so, does it make the performance of workers better, worse or both (i.e. more varied)? Do crowd-workers need to be incentivised by fi- nancial rewards (i.e. bonuses/prizes to the best performers) to improve their performance? If crowd-workers are already motivated to perform well through competition, are test questions (i.e. the vetting of workers using pre-task quizes and in-task assessments) still necessary in order to mo- tivate them to provide quality judgements? Given these research questions, the main contribu- tions of the paper can be summarised as follows: We show that it is straightforward to augment a standard crowdsourcing platform with basic real- time feedback and leaderboard functionality while preserving privacy and anonymity of the crowd- workers. We introduce an experimental design based on standard A/B/n testing [KLSH09] that assigns crowd-workers directly to control and treatment groups. By further partitioning documents across the experiments we look to prevent contamination both within and between experiments. We demonstrate via controlled experiments that providing real-time feedback to crowd-workers via a leaderboard while oering them a financial re- ward for high achievement likely motivates them to provide better relevance judgements than they would otherwise, (i.e. if no leaderboard were present).

Upload: phamnga

Post on 25-Jul-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

A Little Competition Never Hurt Anyone’s Relevance

Assessments

Yuan Jin, Mark J. Carman

Faculty of Information Technology

Monash University

{yuan.jin,mark.carman}@monash.edu

Lexing Xie

Research School of Computer Science

Australian National University

[email protected]

Abstract

This paper investigates the e↵ect of real-timeperformance feedback and competition on theaccuracy of crowd-workers for a document rel-evance assessment task. Through a series ofcontrolled tests, we show that displaying aleaderboard to crowd-workers can motivatethem to improve their performance, provid-ing a bonus is o↵ered to the best performingworkers. This e↵ect is observed even whentest questions are used to enforce quality con-trol during task completion.

1 Introduction

Crowd-sourcing involves enlisting the skills and exper-tise of many individuals (usually over the Internet)to achieve tasks that require human-level intelligence.Paid crowd-sourcing platforms, such as Amazon’s Me-chanical Turk1 and CrowdFlower2, allow for the low-cost outsourcing of labelling tasks such as judging doc-ument relevance to Web search queries.

In this paper we investigate paid crowd-sourcing ap-plied for the collection of relevance judgements for In-formation Retrieval tasks and look to better under-stand the role performance feedback and competitioncan play in motivating crowd-workers to provide high-quality judgements. More specifically, we seek to sys-tematically answer the following research questions :

Copyright

c� by the paper’s authors. Copying permitted for

private and academic purposes.

In: F. Hopfgartner, G. Kazai, U. Kruschwitz, and M. Meder(eds.): Proceedings of the GamifIR 2016 Workshop, Pisa, Italy,21-July-2016, published at http://ceur-ws.org

1https://www.mturk.com/mturk/welcome2https://www.crowdflower.com/

• Does providing real-time feedback to crowd-workers regarding their relevance judging accu-racy a↵ect their average performance? And if so,does their performance improve or worsen?

• Does the competition between crowd-workers thatresults from providing a leaderboard to them a↵ecttheir performance? And if so, does it make theperformance of workers better, worse or both (i.e.more varied)?

• Do crowd-workers need to be incentivised by fi-nancial rewards (i.e. bonuses/prizes to the bestperformers) to improve their performance?

• If crowd-workers are already motivated to performwell through competition, are test questions (i.e.the vetting of workers using pre-task quizes andin-task assessments) still necessary in order to mo-tivate them to provide quality judgements?

Given these research questions, the main contribu-tions of the paper can be summarised as follows:

• We show that it is straightforward to augment astandard crowdsourcing platform with basic real-time feedback and leaderboard functionality whilepreserving privacy and anonymity of the crowd-workers.

• We introduce an experimental design based onstandard A/B/n testing [KLSH09] that assignscrowd-workers directly to control and treatmentgroups. By further partitioning documents acrossthe experiments we look to prevent contaminationboth within and between experiments.

• We demonstrate via controlled experiments thatproviding real-time feedback to crowd-workers viaa leaderboard while o↵ering them a financial re-ward for high achievement likely motivates themto provide better relevance judgements than theywould otherwise, (i.e. if no leaderboard werepresent).

• We demonstrate that even when control/testquestions are used to vet and remove low perform-ing crowd-workers, the positive e↵ects of real-timefeedback and competition are still present.

The paper is structured as follows. We first discussrelated work in Section 2, followed by describing theexperimental setup and three controlled experimentsin Section 3. We draw the final conclusions in Section4.

2 Related Work

Alonso and Mizzero [AM09] demonstrated in 2009 thatcrowd-workers on a popular crowd-sourcing platformcan be e↵ective for collecting relevance judgements forInformation Retrieval evaluations with the workers insome cases being as precise as TREC relevance asses-sors. Since then there has been a great deal of workin the Information Retrieval community investigatingthe e�cacy of crowd-sourcing for collecting relevancejudgements, with guides being developed [ABY11] andthree years of TREC competitions organised [SKL12].

Grady and Lease [GL10] investigated the impor-tance of user-interface features and their e↵ect on thequality of relevance judgements collected. In particu-lar, they investigated human factors such as whethero↵ering to pay a bonus for well justified responses(they asked users to explain their judgements) resultedin higher accuracy judgements. They found that ac-curacy did indeed markedly increase as a result of theo↵er. (We find similar e↵ects and control for them inour experiments.)

A recent survey of gamification ideas (such as pro-viding points and leaderboards to users) applied toInformation Retrieval tasks was provided by Munteanand Nardini [MN15]. Of particular note is the workby Eickho↵ et al. [EHdVS12], who developed a game-based interface for collecting relevance judgements inan attempt to improve the quality and/or reduce thecost of crowdsourced relevance judgements. Despitesimilar goals to ours, their game-based interface wasnot integrated within a standard crowd-sourcing envi-ronment, and relied on quite di↵erent mechanisms forestablishing which documents were relevant to whichqueries, thus making the work not directly comparableto ours.

Harris [Har14] also investigated gamification ideaswith respect to relevance judgement and in particularinvestigated the di↵erence between user’s own percep-tion of relevance and their perception of the “averageuser’s” opinion of what constitutes relevance. Harrismade use of a leaderboard at the start and end of thegame, but did not provide real-time feedback aboutperformance via the leaderboard directly to the indi-vidual crowd-workers like our work does.

In addition to developing practical expertise in thecollection of crowdsourced relevance judgements, re-searchers have developed specialised algorithms foridentifying the most likely correct judgement (or aposterior over relevance judgements) for a query-document pair based on a conflicting set of judgementsfrom multiple users (see for example [MPA15]). Otherwork has looked at making direct use of crowd-sourcedrelevance judgements in order to train rank-learningsystems (see for example [MWA+13]).

Despite all of this previous research, we are unawareof any work that has tested specifically the conjecturethat providing real-time feedback to crowd-workers re-garding their relevance judging performance can im-prove the same. Moreover, the use of leaderboardsto motivate crowd-workers to provide better relevancejudgements has not been previously investigated to thebest of our knowledge.

3 Experiments

In this section, we detail the three experiments per-formed to address the research questions specified inSection 1.

3.1 Dataset

We made use of the “assigned-judged.balanced” datasubset3 from the TREC 2011 crowd-sourcing track4

with 750 expert judgements made by NIST assessorsfor 60 queries and the ClueWeb09 collection5 fromwhich the documents judged for relevance were drawn.

3.2 Experiment Design

In each experiment, we collected relevance judgementsfor a crowd-sourcing task undertaken by workers onthe Crowdflower platform. Within each task, weperformed A/B/n testing of some control and treat-ment groups. We followed standard techniques foronline experimentation [KLSH09], dividing workersas equally as possible across di↵erent control andtreatment groups in each task. Moreover, we ran-domly divided the chosen data subset into three dis-joint sets of document-query pairs in order to preventcrowd-workers who might partake in multiple exper-iments from ever judging the same document-querypair twice. Within an experiment, the same set ofquery-document pairs were shown to all crowd-workersacross (and within) the control and treatment groupsin order to remove variability due to (query-documentpair) judgement di�culty.

3This data subset is balanced across three relevance levels(i.e. highly relevant, relevant and non-relevant)

4https://sites.google.com/site/treccrowd/20115http://lemurproject.org/clueweb09/

Figure 1: Architecture of experiment environment.Crowd-workers are randomly assigned to a particulargroup after joining the task on CrowdFlower, and seethe same version of the interface for each subsequentinteraction from their browsers. The server updatestheir performance statistics (or rank positions) onceeach page of judgements was completed.

The Crowdflower platform is designed for ease ofuse in creating the crowd-sourcing tasks, but not forthe A/B/n testing within each task. Thus in order toperform controlled experiments we employed the ex-periment architecture shown in Figure 1. The crowd-workers first join our tasks held on CrowdFlower plat-form, causing a task page to load on their browser.At this point, client-side Javascript sends a request foradditional page content to the A/B/n testing server.The server then randomly allocates each individual toone of the control and treatment groups for the du-ration of the experiment6, and returns content to beinserted into the task page.

The inserted content provided di↵erent informationto crowd-workers in di↵erent experimental groups. Forexample, the task page shown in Figure 2 contains theperformance statistic (e.g. labelling accuracy) for theworker and his/her own view on the shared leader-board, while the task page shown in Figure 3 (for a dif-ferent group) contains no inserted content. The perfor-mance statistics for each crowd-worker is updated bythe server based on the ground-truth relevance judge-ments (from NIST assessors) at the completion of eachtask page (consisting of 5 relevance judgements). Thecrowd-worker’s view on the leaderboard is updatedeach time they load or refresh the task page. Theleaderboard shown for the various experiments con-tained: (i) the rank of the crowd-worker, (ii) anony-mous usernames for each of the crowd-workers, (iii) thescore for each worker (computed using metrics to bespecified in the following sections), and (iv) the change

6We used random assignment rather than sequential assign-ment to prevent biases due to the order in which individuals jointhe tasks.

Figure 2: A screenshot of the interface shown tocrowd-workers assigned to Treatment group 2 forExperiment 1, containing a leaderboard rankingcontributors based on their labelling accuracy inpercentage.

Figure 3: A screenshot of the interface shown tocrowd-workers assigned to the control group forExperiment 1, which contained no additional infor-mation about the workers’ performance.

in the rank of each worker since the previous refresh.Crowd-workers were free to quit a task at any point,

sometimes even without making any judgement, mean-ing that the number of workers can di↵er across di↵er-ent groups due to both random assignment and self-removal.

We adopted the setup used in the TREC 2011crowd-sourcing track where a worker judged the rel-evance of 5 documents per query. The 5 documentsper query were embedded as images in one page inrandom order for all the tasks except the one whichwill be specified in Section 3.5.

3.3 Experiment 1: Real-time Feedback

In the first experiment, we investigated whether pro-viding real-time feedback to crowd-workers a↵ectedtheir performance. To do this we randomly assigned

crowd-workers to one of four groups:

• Control Group: No performance feedback wasprovided to the workers.

• Treatment 1: The workers were told their cur-rent estimated accuracy and the current averageestimated accuracy of all the workers in the samegroup. (The latter information was provided foranchoring purposes, so that the workers knewwhether their accuracy scores were relatively highor low.)

• Treatment 2: A leaderboard was provided tothe workers showing their current ranking in thegroup scored by their estimated accuracy. On theboard, workers were referred to as “Contributori”where i denoted the order in which they were as-signed to this group. Thus no personally iden-tifying information (such as CrowdFlower user-names) was leaked between contributors. More-over, workers got to see the changes in their rank-ing every time they reloaded current pages orloaded new ones.

• Treatment 3: Modify the scores for the work-ers on the leaderboard to be the product of theirestimated accuracy and the numbers of responsesthey have made. The idea was to encourage work-ers to keep annotating in order to rank higher onthe board.

For this experiment, each task consisted of labelling5 documents for the same query as {highly relevant,relevant, non-relevant}. Crowd-workers could com-plete a maximum of 20 tasks each (i.e. label 100 docu-ments) and each task was made available to 40 crowd-workers. The documents used to build the tasks wererandomly sampled from the balanced data subset ofthe TREC2011 dataset.

In all experiments, we used the term “estimated ac-curacy”, even though we calculated the accuracy basedon the ground-truth relevance judgements, in order tobe consistent with the general case where the ground-truth judgements are unknown and need to be esti-mated. Moreover, we applied Dirichlet smoothing tothe accuracy estimate7 µ̂i:

µ̂i =(Pni

j=1 1(rij = yj)) + ↵ 13

ni + ↵(1)

Where ni is the number responses from Worker i, rijdenotes the individual’s response to question j, yj is

7More formally, we employed an estimate of the posteriormean assuming a Binomial likelihood (over correct/incorrect an-swers) and a Beta prior with concentration parameter ↵ andprior mean (probability of a correct answer) of 1/3 due to thebalanced classes used.

Table 1: Results across the four groups for Experi-ment 1. Accuracy is given as both Micro and Macroaverages, with the latter being aggregated at theworker level. Crowd-workers who judged less than 50documents were excluded from the analysis.

# # AccuracyGroup Workers Judgements (Micro) (Macro)Control 44 3874 45.04% 45.39%

Treatment 1 40 3449 45.26% 43.20%Treatment 2 48 3947 47.76% 47.86%Treatment 3 39 3613 46.39% 47.06%

Control Treatment 1 Treatment 2 Treatment 3

0.2

0.3

0.4

0.5

0.6

0.7

Experiment 1: Realtime Feedback

Accu

racy

Figure 4: A boxplot showing the distribution ofAccuracy values across the crowd-workers for di↵erentgroups in Experiment 1.

the ground-truth response, and ↵ is a smoothing pa-rameter set to 5 so that the number of pseudo-countsequals the number of questions per page. The mainreason for smoothing was to prevent crowd-workersfrom providing a small number of judgements on whichthey performed unusually well (e.g. judge all 5 doc-uments correctly on the first page) and then quittingthe task to win the competition.

Some contributors might have provided fewer rele-vance judgements than others because they (i) arrivedlate to the task when few pages were left or (ii) gaveup early on. To prevent problems due to poor esti-mation of worker accuracy and/or weaker e↵ects dueto receiving feedback for shorter periods of time, datafrom workers who had provided less than 50 judge-ments were discarded across all groups (and were notincluded in the following analysis).

Table 1 shows experimental results across the fourgroups measured in terms of both micro and macro-averaged Accuracy8. The former is averaged equally

8Accuracy was calculated treating the graded relevance lev-els {highly relevant, relevant, non-relevant}, as separate classes.Thus the assignment of label highly relevant to a document with

across all judgements and the latter equally acrossall crowd-workers. Due to the di↵ering levels of abil-ity/accuracy of the crowd-workers, the macro-averageis the main measure of interest for determining changesin worker accuracy across the groups.

A boxplot in Figure 4 shows the distribution (me-dian and inter-quartile ranges) of performance forworkers in the various groups. We see that the spreadof values is similar across the four groups.

To determine whether the di↵erences in the meanperformance across the groups were significant at theworker level, we employed a pairwise t-test (with non-pooled standard deviations and a Bonferroni correc-tion for multiple comparisons). Given the test pro-cedures, we did not find any significant di↵erence be-tween the groups9, indicating that the e↵ect size forthe two treatments is small at best, i.e. there is lit-tle (if any) e↵ect on performance that results fromsimply providing a leaderboard to crowd-workers andexperiments with larger numbers of workers would berequired to determine the size of the e↵ect.

We also note from Table 1 that the score (the es-timated number of correct judgements) used to rankcrowd-workers in treatment 3 didn’t appear to workas well as the score (the estimated accuracy) usedin treatment 2 in terms of both micro and macro-averaged Accuracy. We conjecture that the reason forthis might have been that new workers started verylow on the leaderboard in the former case, and it tookthem a long time to rise up amongst the leaders, whichcaused apathy towards higher rankings.

3.4 Experiment 2: Adding a Bonus

Given the results of Experiment 1, we believed thecrowd-workers needed to be further motivated andtherefore moved to investigate whether incentivisingthe workers by o↵ering to pay them a bonus at theend of the task a↵ected their performance and if so,whether it caused them to perform better or worse.

More specifically, a $1 bonus was rewarded to thosewho ended up within the top 10 of the leaderboard. Wefollowed the same setup used in Experiment 1 exceptthe number of responses collected for each document-query pair for all the groups was now 30. The controland the treatment groups involved in this experimentare listed as follows:

true label relevant is considered incorrect. We repeated theanalysis using Accuracy computed over binary relevance judge-ments, i.e. where the labels highly relevant and relevant werecollapsed into the same class, finding similar results.

9The smallest P-value observed (before the Bonferroni cor-rection) was 0.19 between Treatment 2 and the Control group,so not significant at the 0.05 confidence level even before cor-rection for multiple comparisons.

Figure 5: A screenshot of the interface shown tocrowd-workers assigned to Treatment Group 2 inExperiment 2.

• Control Group 1: Neither performance infor-mation nor bonus was given to the workers as-signed to this group.

• Control Group 2: No performance informationwas provided to the workers while they were per-forming their annotation, but they were informedthat the top 10 workers would be paid a $1 bonusat the completion of the task (once all the work-ers submitted their responses). Workers were alsoinformed that their performance would be judgedaccording to their estimated numbers of correctresponses10.

• Treatment 1: the configuration was the sameas for Control group 2 except that a leaderboardwas provided to the workers in the group showingtheir rankings in terms of the Dirichlet-smoothedaccuracy estimate.

• Treatment 2: the configuration was the sameas for Treatment group 1 except that the scoresshown on the leaderboard were the estimated num-ber of correct responses (as defined in ControlGroup 2) of the workers.

The results for the second experiment are shownin Table 2 with distributions over worker accuraciesshown in Figure 6. Of note in the figure is the smallerinterquartile range for Treatment group 2 with respectto the other groups (or the previous experiment). Thismay indicate that crowd-workers tend to perform moreconsistently when competing for bonuses on a leader-board.

Treatment group 2 exhibits approximately 3%higher accuracy than the other groups for this exper-iment. However, pairwise T-Test results show no sig-nificant di↵erence between the various groups at the0.05 level after a Bonferroni correction11. The small-

10Estimated number of correct responses from Worker i isthe product of the individual’s Dirichlet-smoothed accuracy es-timate and the number of responses he/she has made.

11Were it not for the correction for multiple comparisons, sig-nificant di↵erences would have been claimed. Moreover, one-way

Table 2: Results for Experiment 2, where a bonuswas o↵ered to participants in Control group 2 andTreatment groups 1 and 2.

# # AccuracyGroup Workers Judgements (Micro) (Macro)

Control 1 26 2419 41.71% 42.19%Control 2 28 2624 42.11% 42.59%

Treatment 1 27 2665 43.83% 43.84%Treatment 2 33 3139 46.54% 46.85%

Control 1 Control 2 Treatment 1 Treatment 2

0.3

0.4

0.5

0.6

Experiment 2: Adding a Bonus

Accu

racy

Figure 6: A boxplot of Accuracy across the dif-ferent groups in Experiment 2 where a $1 bonuswas paid to the top 10 contributors in Control group2 and Treatment groups 1 and 2 at the end of the task.

est P-value is 0.14 between treatment 2 and control 1.Values indicate that significant di↵erences would likelybe observed for an experiment with a larger numberof crowd-workers.

3.5 Experiment 3: Control Questions

Control/test questions are often used in crowd-sourcing systems to check whether crowd-workers are(i) qualified for a certain task and (ii) constantly mo-tivated to perform at a reasonable level throughoutthe task (i.e. not assigning random answers or sameanswers to all the questions12).

In CrowdFlower, conditions (i) and (ii) are imple-mented as respectively a quiz that workers must passin order to embark on a task and task pages each ofwhich contains one test question (amongst the othernon-test ones).

In Experiment 3, we investigated what e↵ect in-troducing such control/test questions would have oncrowd-workers’ performance of judging relevance on

ANOVA rejects the null that the group means are equal with aP-value of 0.019.

12This behaviour was observed in Experiment 1 for two work-ers in the Control group.

Table 3: Results for Experiment 3, where test ques-tions were used (in both the control and the treatmentgroups) to vet and remove crowd-workers based ontheir corresponding accuracy.

# # AccuracyGroup Workers Judgements Micro MacroControl 40 3495 52.85% 52.53%

Treatment 45 4070 54.84% 54.82%

CrowdFlower and whether the real-time feedback andthe competition functionality provided by the leader-board were still useful for motivating the workers toannotate more and better given the certain level ofquality control provided by the test questions.

More specifically, for each group in Experiment 3,we tried to collect 50 responses for each of the 100 taskquestions di↵ering from those used in Experiments 1and 2. The task was organised to comprise one quizpage which contained 5 test questions (not includedin the task questions) and 20 task pages each of whichcontained one test question (included in the task ques-tions) randomly inserted by CrowdFlower and fournon-test ones randomised by us to appear on the samepage. As a result, the 5 documents per page in Exper-iment 3 corresponded to di↵erent queries. There werein total 25 test questions (5 for the quiz and 20 forthe test pages) and 80 non-test questions in this ex-periment. Thus a qualified crowd-worker could judgea maximum of 105 documents (5 in the quiz and 100in the task itself). Moreover, we set the minimum ac-curacy of the test questions to 0.6, higher than theaverage Macro Accuracy (around 0.45) achieved bythe groups in the previous experiments to allow thetest questions to a↵ect crowd-workers’ annotation pro-cesses, while not so high that the workers get expelledfrom the task early on.

Only two groups were investigated in this experi-ment:

• Control Group: Test questions were used to se-lect crowd-workers based on their correspondingaccuracy, which in turn was provided back to theworkers by CrowdFlower. No other performanceinformation was provided, but workers were in-formed that the top 10 would be paid a $1 bonusat the completion of the task.

• Treatment 1: The configuration was the sameas for the Control group except that a leaderboardwas provided to the workers showing their rank-ings in the group in terms of the estimated num-ber of correct responses across all the documentsjudged (i.e. both test and non-test questions).

The results of the third experiment are shown inTable 3. We note that the overall accuracy across

Control Treatment

0.40

0.45

0.50

0.55

0.60

0.65

Experiment 3: Test Questions

Accuracy

Figure 7: A boxplot of Accuracy across the di↵erentgroups in Experiment 3 where test questions were usedto guarantee a certain level of relevance judging accu-racy.

both the control and treatment groups is much higherthan it was for the previous experiments, as was tobe expected, given that the poor performing crowd-workers (as judged by their accuracy on the test ques-tions) were prevented from further relevance judgingand their previous judgements removed from the anal-ysis (if they had managed to judge less than 50 docu-ments prior to being disqualified).

The spread of worker performance shown in Fig-ure 7 is similar for the control and treatment groups,with the treatment group showing higher median (andmean) performance across crowd-workers, which isconsistent with the hypothesis that providing a leader-board motivates improved judging performance. How-ever, the di↵erence between the means was not foundto be statistically significant at the 0.05 level withP-value of 0.071, indicating that a larger experimentwas required. We therefore repeated the experimentthree weeks later13, doubling the sample size of crowd-workers involved. For the repeated experiment, the av-erage accuracy was 52.9% for the 105 crowd-workers inthe treatment group versus 51.2% for the 92 workersfrom the control group14, and a significant di↵erence incrowd-worker accuracy across the groups was observedwith P-value of 0.039.

13We waited to repeat the experiment in order to reduce thechance that any crowd-workers from the previous experimentwould participate in the repeated experiment.

14The fact that the macro-accuracy values (for both controland treatment groups) were lower for the repeated experimentthan in the original experiment may be due to the fact that thesame number (10) of $1 bonuses were o↵ered twice the numbercrowd-workers.

4 Conclusions

In this paper we investigated the e�cacy of providingreal-time performance feedback to crowd-workers whoare annotating document-query pairs with relevancejudgements. The main findings of the investigationwere:

1. Solely providing basic feedback to crowd-workersin the form of real-time performance informationor a leaderboard has little e↵ect on their relevancejudging accuracy.

2. Providing a monetary incentive appears necessaryin order to motivate crowd-workers to competeto achieve better annotation performance. More-over, the provision of a leaderboard appears nec-essary in order to get the maximum e↵ect of thepayment of bonuses to the best performing rele-vance judges.15

3. Including test questions improves overall ratingaccuracy, and does not adversely a↵ect the use-fulness of a leaderboard.

There are a number of interesting directions for fu-ture work:

• We intend to repeat experiments for the real-lifeuse-case where the ground truth relevance judge-ments are not known, and the accuracy of crowd-workers must be estimated based on the relevancejudgements collected across workers up to thatpoint. In this case, EM-based algorithms suchas Dawid-Skene [DS79] can be used to estimatequality of each crowd-worker (as well as the pos-terior over relevance for each document). The es-timated accuracy values will likely exhibit greatervariability and it will be interesting to see whetherthat increased variability has an e↵ect on crowd-workers actual accuracy.

• We intend to investigate other performance mea-sures such as the amount of time taken to providerelevance judgements and the number of relevancejudgements per worker to see whether real-timefeedback and competition causes crowd-workersto annotate more items. We would also like todeepen the analysis of the collected data to inves-tigate whether the accuracy of workers improvesover time and what e↵ect competition has onworkers’ performance over time.

15We note that the second finding was not significant at the0.05 level, but the controlled manner in which we performedthe tests, and the consistency of the improvement across exper-iments is indicative of the fact that larger repeat experimentswould discover significant results.

• Finally, there is an enticing opportunity to pro-vide real-time rewards to crowd-workers basedon their performance and possibly even link theamount of bonus paid to the estimated accu-racy of the crowd-worker, thereby leading to insome sense “economically optimal” (from a deci-sion theoretic point of view) crowd-sourcing ap-proaches.

Acknowledgments

The authors thank Wray Buntine for useful discus-sions regarding this work and Chunlei Chang for pro-gramming assistance. Mark Carman acknowledgesresearch funding through a Collaborative ResearchProject award from National ICT Australia.

References

[ABY11] Omar Alonso and Ricardo Baeza-Yates.Design and implementation of relevanceassessments using crowdsourcing. In Ad-vances in information retrieval, pages153–164. Springer, 2011.

[AM09] Omar Alonso and Stefano Mizzaro. Canwe get rid of trec assessors? using mechan-ical turk for relevance assessment. Pro-ceedings of the SIGIR 2009 Workshop onthe Future of IR Evaluation, pages 15–16,2009.

[DS79] A. P. Dawid and A. M. Skene. Max-imum likelihood estimation of observererror-rates using the em algorithm. Jour-nal of the Royal Statistical Society. Se-ries C (Applied Statistics), 28(1):pp. 20–28, 1979.

[EHdVS12] Carsten Eickho↵, Christopher G Harris,Arjen P de Vries, and Padmini Srini-vasan. Quality through flow and immer-sion: gamifying crowdsourced relevanceassessments. In Proceedings of the 35thinternational ACM SIGIR conference onResearch and development in informationretrieval, pages 871–880. ACM, 2012.

[GL10] Catherine Grady and Matthew Lease.Crowdsourcing document relevance as-sessment with mechanical turk. In Pro-ceedings of the NAACL HLT 2010 Work-shop on Creating Speech and LanguageData with Amazon’s Mechanical Turk,CSLDAMT ’10, pages 172–179, Strouds-burg, PA, USA, 2010. Association forComputational Linguistics.

[Har14] Christopher G Harris. The beauty con-test revisited: Measuring consensus rank-ings of relevance using a game. In Pro-ceedings of the First International Work-shop on Gamification for Information Re-trieval, pages 17–21. ACM, 2014.

[KLSH09] Ron Kohavi, Roger Longbotham, DanSommerfield, and Randal M Henne. Con-trolled experiments on the web: surveyand practical guide. Data mining andknowledge discovery, 18(1):140–181, 2009.

[MN15] Cristina Ioana Muntean and Franco MariaNardini. Gamification in information re-trieval: State of the art, challenges andopportunities. In Proceedings of the 6thItalian Information Retrieval Workshop,IIR’2015, 2015.

[MPA15] Pavel Metrikov, Virgil Pavlu, andJaved A. Aslam. Aggregation of crowd-sourced ordinal assessments and inte-gration with learning to rank: A latenttrait model. In Proceedings of the 24thACM International on Conference onInformation and Knowledge Management,CIKM ’15, pages 1391–1400, New York,NY, USA, 2015. ACM.

[MWA+13] Pavel Metrikov, Jie Wu, Jesse Anderton,Virgil Pavlu, and Javed A. Aslam. A mod-ification of lambdamart to handle noisycrowdsourced assessments. In Proceedingsof the 2013 Conference on the Theory ofInformation Retrieval, ICTIR ’13, pages31:133–31:134, New York, NY, USA, 2013.ACM.

[SKL12] Mark D Smucker, Gabriella Kazai, andMatthew Lease. Overview of the trec 2012crowdsourcing track. Technical report,DTIC Document, 2012.