quality of service in crowd-powered systems · 2020-05-18 · but i would be remiss not to thank:...
TRANSCRIPT
DEPARTMENT OF INFORMATICS
UNIVERSITY OF FRIBOURG (SWITZERLAND)
Quality of Service in
Crowd-Powered Systems
THESIS
Presented to the Faculty of Science of the University of Fribourg (Switzerland)
in consideration for the award of the academic grade of
Doctor scientiarum informaticarum
by
DJELLEL EDDINE DIFALLAH
from
ALGIERS, ALGERIA
Thesis No. 1912
UniPrint
2015
Weak human + machine + superior process was greater
than a strong computer and, remarkably, greater than
a strong human + machine with an inferior process.
— Garry Kasparov
AcknowledgementsFirst, I would like to extend my deepest gratitude to my advisor, Philippe Cudré-Mauroux, who
gave me the opportunity to work with him and provided me all the necessary ingredients to
grow as a researcher. His wisdom, generosity and kindness will always be an inspiration to me.
I am especially grateful to Gianluca Demartini, who instilled in me his passion for the topic,
and with whom I had exhilarating discussions on how to shape the future of crowdsourcing. I
would also like to thank the rest of my thesis committee, Panos Ipeirotis, Béat Hirsbrunner,
and the jury president Ulrich Ultes-Nitsche, for their availability, insights and questions, which
helped me to create the final form of this thesis.
I have been extremely fortunate to be surrounded by brilliant colleagues and friends at the
eXascale Infolab: Marcin Wylot, Jean-Gérard Pont, Roman Prokofyev, Alberto Tonon, Mar-
tin Grund, Victor Felder, Michael Luggen, Ruslan Mavlyutov, Artem Lutov, Mourad Khayati,
Dingqi Yang and not forgetting our friends Michele Catasta from EPFL and Monica Noselli. I
am truly appreciative for all the stimulating discussions, critics, and motivation you provided
during my PhD.
I am also thankful for the time, support and encouragement of Carlo Curino who gave me the
opportunity to spend three months at Microsoft CISL. Also I would like to thank the rest of the
CISL team: Chris Douglas, Russell Sears, Sriram Rao and Raghu Ramakrishnan, for their help,
availability and mentorship.
In addition, I would like to extend a warm thank you to all the extraordinary researchers I had
the chance to collaborate within and visit at multiple occasions and for numerous projects,
but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna.
Finally, this work would not have been possible without my family and friends who always
encouraged me to pursue my dreams, and provided unconditional support throughout the
years. For their love, support and patience, I will eternally be thankful.
Fribourg, September 2015 Djellel Eddine Difallah
v
AbstractHuman-machine computation opens the doors to a new class of applications that leverage
the scalability of computers with the yet unmatched cognitive abilities of the human brain.
Such a new synergy is today possible thanks to the advent of programmable micro-task
crowdsourcing platforms that facilitate the recruitment and compensation of online users.
Today, crowdsourcing is leveraged in many fields including data management, information
retrieval and machine learning. For example, a crowd-powered data management system
makes it possible to process new types of tasks including subjective sorting, semantic joins, or
complex data integration. While the use of human-machine computation fills-in a significant
gap in intelligent data processing, it often raises concerns about the overall Quality of Service
(QoS) guarantees that such hybrid systems can offer to the end-users, in terms of efficiency
and effectiveness of the collected results.
In this thesis, we investigate, design, and evaluate several techniques and algorithms that
improve the efficiency and the effectiveness of crowd-powered systems. We tackle the follow-
ing crowdsourcing-specific QoS aspects: quality of responses, progress of batches of tasks,
and load-balancing of heterogeneous tasks among crowd workers. In order to improve those
aspects, we explore techniques stemming from expert finding, human resources, and cluster
management practices to derive our solutions that take into account inherent human-machine
differences, e.g., unpredictability, preferences, and poor context switching. Specifically, we
make the following contributions: (1) We reduce the error on multi-choice tasks by contin-
uously assessing the quality of the workers using probabilistic inference. Our model uses
signals from test tasks, peer consensus, and confidence scores obtained from machine based
solvers. (2) We propose a task assignment model (push-crowdsourcing) that matches tasks
with potentially better-suited users. For that purpose, we index the workers’ profiles based on
their provided social network information. (3) We avoid the stagnation of a crowdsourcing
campaign by providing monetary incentives favoring worker retention. (4) We load balance
tasks across multiple workers to improve the efficiency of a multi-tenant system. Here, we
adopt cluster scheduling approaches for their scalability and adapt them to reduce context
switching for the workers.
We experimentally show that, by using such approaches, one can improve the quality of
the answers provided by the crowd, boost the speed of crowdsourcing campaigns, and load-
balance the crowd workforce across heterogeneous tasks.
Keywords: Crowdsourcing, Crowd-Powered Systems, Human Computation, Quality of Service.
vii
ZusammenfassungVom Menschen unterstützte Berechnung (Human-machine computation) eröffnet eine neue
Art von Applikationen welche die Skalierbarkeit von Computern mit den bisher nicht erreich-
ten kognitiven Fähigkeiten des menschlichen Gehirns verbindet. Diese Synergie ist heutzutage
möglich dank neuer programmierbarer Kleinstaufgaben-Crowdsourcing-Plattformen welche
es ermöglicht Arbeiter im Internet zu rekrutieren und auch zu kompensieren.
Heutzutage wird Crowdsourcing in vielen verschiedenen Feldern wie dem Datenmanagement,
der Informationsgewinnung und den maschinellen Lernverfahren eingesetzt. Beispielsweise
kann ein Datenmanagementsystem welches durch die Crowd unterstützt wird neuartige Pro-
zesse ausführen wie das subjektive Sortieren, das semantische Verknüpfen oder die komplexe
Integration von Daten. Der Einbezug von Menschen um intelligente Datenverabeitung zu
ermöglichen füllt eine signifikante Lücke, dies aber zu Lasten der Garantie der Qualität des
Dienstes (QoS) welches ein solches hybrides System dem Endbenützer bietet, besonders
bezüglich der Effizienz und Effektivität der gesammelten Antworten
In dieser Arbeit untersuchen und entwickeln wir verschiedene Vorgehen und Algorithmen
welche die Effizienz und die Effektivität der von der Crowd unterstützten Systemen verbessert.
Wir behandeln die folgenden Crowdsourcing spezifischen QoS Aspekte: Die Aufgabenfehlerra-
te, die Qualität der Antworten, das Voranschreiten in der Aufgabenserie und die Verteilung
von heterogenen Aufgaben an Crowdarbeiter. Diese Themen werden bearbeitet unter Ein-
bezug von Techniken aus den Gebieten der Expertenfindung, dem Personalwesen und der
Cluster Management Praktiken, wobei unserer abgeleitete Lösungen auf inhärente mensch-
maschinen Unterschiede eingehen, wie z.B. der Unvorhersehbarkeit, der Präferenzen und dem
schlechten Kontextwechsel. Wir können Fortschritte in folgenden Punkten aufzeigen: 1.) Wir
vermindern die Fehlerraten von Mehrfachauswahslaufgaben durch kontinuierliches eruiren
mit probabilistischer Inferenz der Qualität der Arbeiter. Unser Model benutzt dazu Daten aus
Testaufgaben, aus dem Teilnehmerkonsens und durch Konfidenzbewertungen aus maschinel-
len Aufgabenlösern. 2.) Ein Model für die Aufgabenzuweisung (push-crowdsourcing) welches
Aufgaben mit potentiell besser geeigneten Benützern verbindet wird vorgestellt. Dazu ziehen
wir die zur verfügung gestellten Profile der Arbeiter auf Sozialen Netzwerken heran. 3.) Wir
verhindern das Abflauen einer Crowdsourcing-Kampagne durch monetäre Anreize welche die
kontinuität der Arbeiter verbessert. 4.) Durch Lastverteilung von Aufgaben auf verschiedene
Arbeiter wird die Effizienz eines Systems mit mehreren Auftraggebern verbessert. Dazu wer-
den Ansätzte aus der Clusterablaufplanung bezüglich der Skalierbarkeit herangezogen und
zusätzlich adaptieren um die Kontextwechsel für die Arbeiter zu minimieren.
ix
Acknowledgements
In Experimenten zeigen wir, durch die Benutzung dieser Ansätzte, dass wir die Qualität der
Antworten der Crowd verbessern können, die Durchführung von Crowdsourcing-Kampagnen
verkürzen können und die Arbeiterschaft auf heterogene Aufgaben verteilen können.
Stichwörter: Crowdsourcing, Crowd-Powered Systems, Human Computation, Servicequalität.
x
RésuméCombiner les capacités cognitives inégalées du cerveau humain avec la puissance de calcul
des ordinateurs ouvre la voie à de nouveaux type de systèmes intelligents dit hybrides homme-
machine. Une telle synergie est désormais possible et accessible grâce à l’avènement de
plateformes de crowdsourcing ; celles-ci facilitent le recrutement et la compensation des
travailleurs éphémères en ligne. De nos jours, le crowdsourcing est utilisé dans différents
domaines, incluant la gestion de données, la recherche d’informations et l’apprentissage-
automatique des machines. Par exemple, un système homme-machine hybride de gestion de
données rend possibles le traitement de nouveaux types de requêtes incluant le tri subjectif,
les jointures sémantiques ou encore l’intégration des sources de données les plus complexes.
Tandis que l’utilisation du crowdsourcing comble un besoin important dans le traitement
intelligent des données, cette pratique suscite des préoccupations quant à la qualité de service
(QOS) que l’on peut garantir à ses utilisateurs.
Dans cette thèse, nous étudions et concevons plusieurs techniques et algorithmes qui ont
pour but d’améliorer l’efficience et l’efficacité des systèmes hybrides homme-machine. Nous
ciblons les aspects suivants de la QoS relevant du domaine du crowdsourcing : le taux d’erreur,
la qualité des réponses, l’avancement de lots de tâches et l’équilibre de charge de travail
entre les utilisateurs. Afin d’adresser ces questions, nous explorons des techniques allant de la
recherche d’experts, aux pratiques de ressources humaines et aux techniques de gestion de
clusters ; ainsi, nos solutions prennent en compte les différences inhérentes entre l’homme et
les machines, par exemple l’imprévisibilité, les préférences et les changements de contexte.
Plus précisement, nous contribuons de la manière suivante : (1) Nous réduisons le taux d’er-
reurs sur des tâches à choix multiples en évaluant continuellement la qualité des travailleurs,
ceci utilisant des models statistiques. (2) Nous proposons un modèle de distribution de tâches
(push-crowdsourcing), qui fait correspondre une tâche avec le meilleur travailleur potentiel.
Pour cela, nous indexons le profil des travailleurs en nous basant sur les informations extraites
de réseaux sociaux. (3) Nous évitons la stagnation d’une campagne de crowdsourcing en
fournissant des primes favorisant la rétention des travailleurs. (4) Nous répartissons les tâches
à plusieurs travailleurs pour améliorer l’efficacité d’un système multi-utilisateurs. Dans ce cas,
nous réutilisons des approches de gestion de cluster et nous les adaptons afin de réduire le
changement de contexte pour chaque travailleur.
Dans les expériences que nous faisons, nous montrons qu’en utilisant de telles méthodes,
nous pouvons améliorer la qualité des réponses fournies par des travailleurs annonymes, nous
augmentons la vitesse des compagnes de crowdsourcing et balançons la charge de tâches
xi
Acknowledgements
hétérogènes.
Mots clefs : Crowdsourcing, Systèmes Hybrides Homme-machine, Qualité de Service.
xii
ContentsAcknowledgements v
Abstract (English/Deutsch/Français) vii
List of figures xvii
List of tables xxi
1 Introduction 1
1.1 Crowdsourcing and Human Computation . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Human Computation and Micro-tasks . . . . . . . . . . . . . . . . . . . . 2
1.1.2 The Amazon Mechanical Turk Marketplace . . . . . . . . . . . . . . . . . 3
1.2 Crowd-powered Algorithms and Systems . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Quality of Service in Crowd-powered Systems . . . . . . . . . . . . . . . . . . . . 4
1.4 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.1 Additional Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 What this Thesis is Not About . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Background in Crowd-Powered Systems 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Crowd-Powered Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Crowd-Powered Database Systems . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Crowd-Powered Database Operators . . . . . . . . . . . . . . . . . . . . . 11
2.2.3 Crowd-Powered Systems in Other Communities . . . . . . . . . . . . . . . 12
2.2.4 Languages and Toolkits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Task Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Task Repetitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2 Test Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.3 Result Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.4 Task Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Task Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Task Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
xiii
Contents
3 An Analysis of the Amazon Mechanical Turk Crowdsourcing Marketplace 19
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 The Evolution of Amazon MTurk From 2009 to 2014 . . . . . . . . . . . . . . . . 21
3.3.1 Crowdsourcing Platform Dataset . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.2 A Data-driven Analysis of Platform Evolution . . . . . . . . . . . . . . . . 21
3.4 Large-Scale HIT Type Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.1 Supervised HIT Type Classification . . . . . . . . . . . . . . . . . . . . . . 26
3.4.2 Task Type Popularity Over Time . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 Analyzing the Features Affecting Batch Throughput . . . . . . . . . . . . . . . . . 28
3.5.1 Machine Learning Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5.2 Throughput Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5.3 Features Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6 Market Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6.1 Supply Attracts New Workers . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6.2 Demand and Supply Periodicity . . . . . . . . . . . . . . . . . . . . . . . . 32
3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 Human Intelligence Task Quality Assurance 37
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.1 The Entity Linking and Instance Matching Use-Cases . . . . . . . . . . . 38
4.2 Preliminaries on the EL and IM Tasks . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 ZenCrowd Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.2 LOD Index and Graph Database . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3.3 Probabilistic Graph & Decision Engine . . . . . . . . . . . . . . . . . . . . 44
4.3.4 Extractors, Algorithmic Linkers & Algorithmic Matchers . . . . . . . . . . 44
4.3.5 Three-Stage Blocking for Crowdsourcing Optimization . . . . . . . . . . . 45
4.3.6 Micro-Task Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 Effective Instance Matching based on Confidence Estimation and Crowdsourcing 46
4.4.1 Instance-Based Schema Matching . . . . . . . . . . . . . . . . . . . . . . . 47
4.4.2 Instance Matching with the Crowd . . . . . . . . . . . . . . . . . . . . . . . 48
4.5 Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5.1 Graph Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.5.2 Reaching a Decision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.5.3 Updating the Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.5.4 Selective Model Instantiation . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.6 Experiments on Instance Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.6.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
xiv
Contents
4.7 Experiments on Entity Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.7.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.7.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.7.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.8 Related Work on Entity Linking and Instance Matching . . . . . . . . . . . . . . . 68
4.8.1 Instance Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.8.2 Entity Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5 Human Intelligence Task Routing 71
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2.2 HIT Generation, Difficulty Assessment, and Reward Estimation . . . . . 73
5.2.3 Crowd Profiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.4 Worker Profile Linker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.5 Worker Profile Selector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.6 HIT Assigner and Facebook App . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.7 HIT Result Collector and Aggregator . . . . . . . . . . . . . . . . . . . . . . 77
5.3 HIT Assignment Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.1 Category-based Assignment Model . . . . . . . . . . . . . . . . . . . . . . 77
5.3.2 Expert Profiling Assignment Model . . . . . . . . . . . . . . . . . . . . . . 78
5.3.3 Semantic-Based Assignment Model . . . . . . . . . . . . . . . . . . . . . . 79
5.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4.2 Motivation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4.3 SocialBrain{r} Crowd Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4.4 Evaluation of HIT Assignment Models . . . . . . . . . . . . . . . . . . . . . 83
5.4.5 Comparison of HIT Assignment Models . . . . . . . . . . . . . . . . . . . . 84
5.5 Related Work in Task Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.5.1 Crowdsourcing over Social Networks . . . . . . . . . . . . . . . . . . . . . 85
5.5.2 Task Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.5.3 Expert Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6 Human Intelligence Task Retention 89
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2 Worker Retention Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2.2 Pricing Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2.3 Visual Reward Clues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2.4 Pricing Schemes for Different Task Types . . . . . . . . . . . . . . . . . . . 93
6.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
xv
Contents
6.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.3.3 Efficiency Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.5 Related Work on Worker Retention and Incentives . . . . . . . . . . . . . . . . . 99
6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7 Human Intelligence Task Scheduling 103
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.1.1 Motivating Use-Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.2 Scheduling on Amazon MTurk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.2.1 Execution Patterns on Micro-Task Crowdsourcing Platforms . . . . . . . 106
7.2.2 A Crowd-Powered DBMS Scheduling Layer on top of AMT . . . . . . . . . 107
7.3 HIT Scheduling Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.3.1 HIT Scheduling: Problem Definition . . . . . . . . . . . . . . . . . . . . . . 109
7.3.2 HIT Scheduling Requirement Analysis . . . . . . . . . . . . . . . . . . . . 109
7.3.3 Basic Space-Sharing Schedulers . . . . . . . . . . . . . . . . . . . . . . . . 111
7.3.4 Fair Schedulers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.3.5 Gang Scheduling for Collaborative HITs . . . . . . . . . . . . . . . . . . . . 113
7.3.6 Crowd-aware Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.4.2 Micro Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.4.3 Scheduling HITs for the Crowd . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.4.4 Live Deployment Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.5 Related Work on Task Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8 Conclusions 127
8.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.1.1 Toward Crowsourcing Platforms with an Integrated CrowdManager . . . 128
8.1.2 Worker Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.1.3 HIT Recommender System . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
8.1.4 Crowd-Powered Big Data Systems . . . . . . . . . . . . . . . . . . . . . . . 130
8.1.5 Social and Mobile Crowdsourcing . . . . . . . . . . . . . . . . . . . . . . . 130
8.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
xvi
List of Figures1.1 Examples of Human Computation applications combining the efficiency of
machines with the effectiveness of humans. . . . . . . . . . . . . . . . . . . . . . 2
1.2 The Mturk worker main interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 The CrowdManager interface with the four componenents that we propose in
this thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 A Human Intelligence Task mockup. . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 Batch throughput versus number of HITs available in the batch. The red line
corresponds to the maximum throughput we could have observed due to the
tracker periodicity constraints. For readability, this graph represents a subset of
3 months (January-March 2014), and HITs with rewards $0.05 and less. . . . . . 22
3.2 The use of keywords to annotate HITs. F r equenc y corresponds to how many
times a keyword was used, and Aver ag eRew ar d corresponds to the average
monetary reward of batches that listed the keyword. The size of the bubbles
indicates the average batch size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 HITs with specific country requirements. On the left-hand side, the countries
with the most HITs dedicated to them. On the right-hand side, the time evolution
(x-axis) of country-specific HITs with volume (y-axis) and reward (size of data
point) information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Keywords for HITs restricted to specific countries. . . . . . . . . . . . . . . . . . . 24
3.5 Popularity of HIT reward values over time. . . . . . . . . . . . . . . . . . . . . . . 25
3.6 Requester activity and total reward on the platform over time. . . . . . . . . . . 25
3.7 The distribution of batch sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.8 Average and maximum batch size per month. The monthly median is 1. . . . . 27
3.9 Popularity of HIT types over time. . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.10 Predicted vs actual batch throughput values for δ = 4hour s. The prediction
works best for larger batches having a large momentum. . . . . . . . . . . . . . . 30
3.11 Computed feature importance when considering a larger training window for
batch throughput prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.12 The effect of new arrived HITs on the work supplied. Here, the supply is ex-
pressed as the percentage of HITs completed in the market. . . . . . . . . . . . . 33
xvii
List of Figures
3.13 Computed autocorrelation on the number of HITs available and on the weekly
moving average of the completed reward (N.B., autocorrelation’s Lag is computed
in Hours). In both cases, we clearly see a weekly periodicity (0-250 Hours). . . . 34
4.1 The architecture of ZenCrowd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 The Label-only instance matching HIT interface, where entities are displayed as
textual labels linking to the full entity descriptions in the LOD cloud. . . . . . . 49
4.3 The Molecule instance matching HIT interface, where the labels of the entities
as well as related property-value pairs are displayed. . . . . . . . . . . . . . . . . 50
4.4 The Entity Linking HIT interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5 An entity factor-graph connecting two workers (wi ), six clicks (ci j ), and three
candidate matchings (m j ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.6 Maximum achievable Recall by considering top-K results from the the inverted
index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.7 Precision and Recall as compared to Matching confidence values. . . . . . . . . 57
4.8 Number of tasks generated for a given confidence value. . . . . . . . . . . . . . . 58
4.9 ZenCrowd money saving by considering results from top-K workers only. . . . . 59
4.10 Distribution of the workers’ precision using the Molecule design as compared to
the number of tasks performed by the workers. . . . . . . . . . . . . . . . . . . . 60
4.11 Average Recall of candidate selection when discriminating on max relevance
probability in the candidate URI set. . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.12 Performance results (Precision, Recall) for the automatic approach. . . . . . . . 64
4.13 Per document task effectiveness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.14 Crowdsourcing results with two different textual contexts. . . . . . . . . . . . . . 65
4.15 Comparison of three linking techniques. . . . . . . . . . . . . . . . . . . . . . . . 66
4.16 Distribution of the workers’ Precision for the Entity Linking task as compared to
the number of tasks performed by the worker (top) and task Precision with top k
workers (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.17 Number of HITs completed by each worker for both IM and EL ordered by most
productive workers first. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.1 Pick-A-Crowd Component Architecture. Task descriptions, Input Data, and a
Monetary Budget are taken as input by the system, which creates HITs, estimates
their difficulty and suggests a fair reward based on the skills of the crowd. HITs
are then pushed to selected workers and results get collected, aggregated, and
finally returned back to the requester. . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Screenshots of the SocialBrain{r} Facebook App. Above, the dashboard displaying
HITs available to a specific worker. Below, a HIT about actor identification
assigned to a worker who likes several actors. . . . . . . . . . . . . . . . . . . . . 76
5.3 An example of the Expert Finding Voting Model. . . . . . . . . . . . . . . . . . . . 78
5.4 Crowd performance on the cricket task. Square points indicate the 5 workers
selected by our graph-based model that exploits entity type information. . . . . 81
xviii
List of Figures
5.5 Crowd performance on the movie scene recognition task as compared to movie
popularity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.6 SocialBrain{r} Crowd age distribution. . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.7 SocialBrain{r} Notification click rate. . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.8 SocialBrain{r} Crowd Accuracy as compared to the number of relevant Pages a
worker likes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.1 The classic distribution of work in crowdsourced tasks follows a long-tail dis-
tribution where few workers complete most of the work while many workers
complete just one or two HITs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2 Screenshot of the Bonus Bar used to show workers their current and total reward. 93
6.3 Screenshot of the Bonus Bar with next milestone and bonus. . . . . . . . . . . . 93
6.4 Effect of different bonus pricing schemes on worker retention over three different
HIT types. Workers are ordered by the number of completed HITs. . . . . . . . . 94
6.5 Average of the HITs execution time with standard error ordered by their sequence
in the batch. Results are grouped by worker category (long, medium and short
term workers). In many cases, the Long term workers improve their HIT time
execution. This is expected to have a positive impact on the overall batch latency. 96
6.6 Overall precision per worker and category of worker for the Butterfly Classifica-
tion task (using Increasing Bonus). . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.7 Results of five independent runs of A, B and C setups. Type A batches include the
retention focused incentive while Type B is the standard approach using fixed
pricing, Batch C uses a higher fixed pricing – but leveraging the whole bonus
budget. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.1 Caption for LOF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.2 The role of the HIT Scheduler in a Multi-Tenant Crowd-Powered System Archi-
tecture (e.g., a DBMS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.3 Results of a crowdsourcing experiment involving 100+ workers concurrently
working in a controlled setting on a HIT-BUNDLE containing heterogeneous
HITs (B1-B5, see section 7.4) scheduled with FS. (a) Throughput (measured in
HITs/minute) increases with an increasing number of workers involved. (b)
Amount of work done by each worker. . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.4 A performance comparison of batch execution time using different grouping
strategies publishing a large batch of 600 HITs vs smaller batches (From B6). . . 117
7.5 A performance comparison of batch execution time using different grouping
strategies publishing two distinct batches of 192 HITs separately vs combined
inside an HIT-BUNDLE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.6 Average Execution time for each HIT submitted from the experimental groups
RR, SEQ10 and SEQ25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.7 Scheduling approaches applied to the crowd. . . . . . . . . . . . . . . . . . . . . 120
7.8 (a) Effect of increasing B2 priority on batch execution time. (b) Effect of varying
the number of crowd workers involved in the completion of the HIT batches. . 121
xix
List of Figures
7.9 An example of a successful scheduling of a collaborative task involving 3 workers
within a window of 10 seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.10 Accuracy and precision of gang scheduling methods. . . . . . . . . . . . . . . . . 122
7.11 Average execution time per HIT under different scheduling schemes. . . . . . . 123
7.12 CDF of different batch sizes and scheduling schemes. . . . . . . . . . . . . . . . 124
7.13 Worker allocation with FS, WCFS and classical individual batches in a live de-
ployment of a large workload derived from crowdsourcing platform logs. Each
color represents a different batch. . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
8.1 The concept of the Flow Theory [42]. . . . . . . . . . . . . . . . . . . . . . . . . . 129
xx
List of Tables3.1 Gini importance of the top 2 features used in the prediction experiment. A large
mean indicates a better overall contribution to the prediction. A positive slope
indicates that the feature is gaining in importance when the considered time
window is larger. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1 Top ranked schema element pairs in DBPedia and Freebase for the Person,
Location, and Organization instances. . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Crowd Matching Precision over two different HIT design interfaces (Label-only
and Molecule) and two different aggregation methods (Majority Vote and Zen-
Crowd). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3 Matching Precision for purely automatic and hybrid human/machine approaches. 57
4.4 Correct and incorrect matchings as by crowd Majority Voting using two different
HIT designs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.5 Performance results for the candidate selection approach. . . . . . . . . . . . . . 62
4.6 Performance results for crowdsourcing with majority vote over linkable entities. 64
4.7 Performance results for crowdsourcing with ZenCrowd over linkable entities. . 65
5.1 A comparison of the task accuracy for the AMT HIT assignment model assigning
each HIT to the first 3 and 5 workers and to AMT Masters. . . . . . . . . . . . . . . 83
5.2 A comparison of the effectiveness for the category-based HIT assignment models
assigning each HIT to 3 and 5 workers with manually selected categories. . . . . 84
5.3 Effectiveness for different HIT assignments based on the Voting Model assigning
each HIT to 3 and 5 workers and querying the Facebook Page index with the task
description q = ti and with candidate answers q = Ai respectively. . . . . . . . . 84
5.4 Effectiveness for different HIT assignments based on the entity graph in the
DBPedia knowledge base assigning each HIT to 3 and 5 workers. . . . . . . . . . 85
5.5 Average Accuracy for different HIT assignment models assigning each HIT to 3
and 5 workers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.1 Statistics for the three different HIT types. . . . . . . . . . . . . . . . . . . . . . . 94
6.2 Statistics of the second experimental setting – English Essay Correction . . . . . 98
7.1 Description of the batches constituting the dataset used in our experiments. . 116
xxi
1 Introduction
Big data is revolutionizing the way businesses operate by supporting decision processes
thanks to massive data gathering and advanced data analysis. Beyond some of the identified
properties of big data (volume, velocity, variety) [149], the complexity of certain classes of
content, and the ad-hoc nature of some analytical requests, poses further challenges not yet
solved by using fully-automated algorithms. Still, unlocking the real potential of data often
resides in processing complex pieces of information that only humans can fully comprehend.
To that end, some companies outsource or hire full-time employees to perform tasks such as
data entry, data pre-processing and data integration. However, this approach quickly shows
its limitations as the volume of data to be processed increases, and the turnarounds become
critical.
Crowdsourcing has emerged as an alternative to outsourcing. It is defined as the act of creating
an open call to perform a job that anyone on the Internet can do [74]. In order to scale, such a
job is usually broken into micro-tasks that the crowd can perform in parallel, hence producing
faster results. More complex crowdsourcing scenarios can be put in place, e.g., job pipelines,
where the output of a task is used as input to another one.
The major counter-argument to the use of crowdsourcing is its inefficiency. In fact, asking
the crowd to process all the records of a large database is not only costly (although the result
can be of a higher value), it is also inherently bound by the crowd size and the speed of the
workers. This makes crowdsourcing impractical for cases where high data velocity and billions
of records are the norms. Moreover, crowd workers usually exhibit high error rates, which can
be rooted in multiple factors like fatigue, subjectivity, priming and even willingness to cheat.
In order to leverage crowdsourcing as a viable solution to complex data processing needs,
and potentially create an added value for the end-users (often called requesters), it is worth
investigating methods to integrate the effectiveness of crowdsourcing with the efficiency of
machines to maintain a high user experience in terms of latency, minimizing the overall cost
and maximizing the quality of crowdsourcing results.
1
Chapter 1. Introduction
Protein Folding
FoldIT
Image Tagging
ESP Game
Data Management
CrowdDB
Computer generated Human Tasks
Human Answers
Crowd Workers Crowdsourcing Platforms
Figure 1.1 – Examples of Human Computation applications combining the efficiency of ma-chines with the effectiveness of humans.
1.1 Crowdsourcing and Human Computation
First coind by Jeff Howe in his article “The Rise of Crowdsourcing” [74], the term crowdsourcing
is nowadays used to describe several types of activities that involve the crowd with different
incentives and expectations. Crowdfunding, for instance, consists in raising money from the
crowd to support a project [99]. Another example is Citizen Journalism, where the crowd
contributes pieces of information like reports, photos, and videos to create novel channels for
news gathering [11].
1.1.1 Human Computation and Micro-tasks
In our context, we refer to crowdsourcing as the general paradigm that leverages human
abilities to solve problems that a computer is not yet capable of solving with acceptable
precision (if at all); we commonly call this concept Human Computation (HC) [157]. In order
to tap into the power of Human Computation at scale, one needs to offer proper incentives to
the crowd, e.g., monetary reward, fun, altruism or social recognition. Figure 1.1 illustrates the
basic interaction between a backend system crowdsourcing platform.
In this thesis, we are interested in paid micro-task crowdsourcing, where the crowd is asked
to perform short tasks, also known as Human Intelligence Tasks (HITs), in exchange for a
small monetary reward per unit. Popular examples of such tasks include: spell checking of
short paragraphs, sentiment analysis of tweets, rewriting product reviews, or transcription of
scanned shopping receipts.
Micro-task crowdsourcing has gained momentum with the emergence of online labor market-
places which facilitate the interaction between requesters and potential workers. A typical
crowdsourcing platform would work as follows: First the requesters design the HIT interface
based on their input data and desired outcome. Next, they publish the HIT onto the crowd-
2
1.2. Crowd-powered Algorithms and Systems
sourcing platform specifying the promised monetary reward in exchange for the completion
of each HIT. Next, those workers willing to perform the published HITs complete the tasks and
submit their results back to the requester who obtains the desired output and compensate
the workers accordingly. There are many popular platforms that offer such services including
Amazon Mechanical Turk (AMT) [1], ClickWorker [2], CloudFactory [3], and CrowdFlower [4].
1.1.2 The Amazon Mechanical Turk Marketplace
Most of the experiments conducted in this thesis were done on Amazon Mechanical Turk. AMTis the oldest and the most popular micro-task crowdsourcing platform, it has a continuous flow
of workers and requesters. AMT provides programmatic Application Programming Interfaces
(APIs) as well as a Web interface for requesters to design and deploy online tasks, and its
activity logs are available to the public [76] and were used in the context of this thesis to
perform an analysis tracing its evolution (see Chapter 3).
AMT adopts a pull methodology, where all the published tasks are publicly presented on a
search-based dashboard (see Figure 1.2). The workers can pick their preferred tasks on a
first-come-first-served basis.
From a requester perspective, the pull crowdsourcing approach has several advantages includ-
ing simplicity and minimization of task completion times, since any available worker from
the crowd can pick and perform any HIT, provided that they meet some pre-requisites set
by the requester. From a worker perspective, it creates competition among requesters, and
potentially leads to high HIT standards, both in terms of interface design, quality, and pricing.
On the other hand, pull crowdsourcing limits the possibilities of the platform to offer any
form of service guarantees to its customers (i.e., the requesters). For example, this mechanism
cannot guarantee priority to a requester who has a deadline, and often the only effective
lever consists in increasing the unit reward of the HITs to attract more workers [12]. It also
cannot guarantee that the worker who performs the task is the best fit, as more knowledgeable
workers might be available within the crowd, but are unable to pick the task on time.
1.2 Crowd-powered Algorithms and Systems
Modern crowdsourcing platforms offer programmatic APIs in order to post HITs, monitor their
progress, collect the results and distribute the rewards. Hence, the idea of combining human
computation and computers to produce a new breed of hybrid Human-machine algorithms
found an opportunity to concretize. Not only can the crowd be invoked programmatically,
using a declarative language, this very process can be parametrized, monitored and embedded
in long running jobs.
A direct application of this idea goes naturally with the class of machine-learning algorithms
that produce their results along with a confidence score. A generic hybrid scheme consists in
3
Chapter 1. Introduction
Figure 1.2 – The Mturk worker main interface.
falling back to the crowd to increase the precision of the results whenever the confidence of
the generated solution falls bellow a predefined threshold. Another application is in active
learning, where a classification algorithm would repeatedly collect training labels from the
crowd – as opposed to a limited number of human operators [123]. Likewise, we refer to the
class of computer systems that would involve the crowd at some point in their execution
as Crowd-Powered Systems. A canonical example is CrowdDB [64], a relational database
management system with an augmented SQL syntax that would trigger queries to execute on
AMT, asking the crowd to perform a predefined data processing operation.
1.3 Quality of Service in Crowd-powered Systems
Humans and machines behave fundamentally differently: While machines can deal with
large volumes of data, with real-time streams, and with flocks of concurrent users interacting
with the system, crowdsourcing is currently seen mostly as a batch-oriented, offline data
processing paradigm. Today, crowdsourcing platforms are not providing any guarantees on
task completion times due to the unpredictability of the crowd workers, who are free to come
and go at any moment, and to selectively focus on an arbitrary subset of the available tasks
only. Moreover, the quality of the provided answers can vary dramatically for the same worker
and across workers. These are the hazards that any crowd-powered system needs to deal with
automatically in order to provide better services to its end-users.
Quality of Service (QoS) is a concept that is mostly used in telephony and computer networks.
It refers to the measures taken to improve (and sometimes guarantee) the overall performance
perceived by the users in terms of throughput, error rate, latency, among other domain-specific
metrics. In this thesis, we specifically investigate effectiveness and efficiency as QoS aspects
4
1.4. Summary of Contributions
Figure 1.3 – The CrowdManager interface with the four componenents that we propose in thisthesis.
that need to be improved in a crowd-powered system. We define these two aspects and the
scope that we consider in Section 1.4.
We note that the Quality of Service that we are targeting in this thesis is best effort. This
limitation is due to several inherent reasons 1) the crowd is not employed and thus not bound
by any contract, and 2) the size of the available workforce can vary widely throughout the day.
Under these conditions, we can potentially model and predict the execution time of a batch
of HITs [62], but not enforce a given promise to the requesters (e.g., in order to finish a task
before a given deadline).
1.4 Summary of Contributions
The goal of this thesis is to: “Investigate, design, and evaluate methods and algorithms that
improve the effectiveness and efficiency of crowd-powered systems”. In practice, we implement
several modules as part of a CrowdManager which, in essence, can be thought of as a smart
network interface that manages and improves the exchanges between the backend system
and the target crowdsourcing platform (see Figure 1.3).
In the following, we detail our contributions and list the associated conference and journal
papers that we have published along our research work. We tackle two major areas that pertain
to the Quality of Service in crowdsourcing, namely: Effectiveness and Efficiency.
A) Effectiveness designates the ability of a system to produce the desired results. Our focus
with that regard is to ensure the high quality of the collected results; in that context we
investigate the following approaches:
HIT Quality Assurance: We first tackle the issue of aggregating HIT responses obtained
from multiple crowd workers. We propose an aggregation mechanism to lower
error rates specifically for tasks with Multiple Choice Questions. Our approach
consists in using a probabilistic model and weighted voting when aggregating the
responses of multiple workers for a given task.
Our first work, ZenCrowd, focused on an entity linking use-case, where the ag-
5
Chapter 1. Introduction
gregation mechanism was used to enhance the results of an automatic entity
linker:
Demartini, Gianluca, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. "Zen-
Crowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-
scale entity linking." Proceedings of the 21st international conference on World Wide
Web. ACM, 2012.
In a follow-up work, we extend the use-case to cover instance matching tasks:
Demartini, Gianluca, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. "Large-
scale linked data integration using probabilistic reasoning and crowdsourcing." The
VLDB Journal 22.5 (2013): 665-687.
HIT Routing: Next, we explore an alternative approach to pull crowdsourcing that
actively selects and pushes tasks to specific crowd workers who might be able to
provide better answers to particular tasks. Although HITs usually do not require
any expertise, we could however leverage the general knowledge of the workers to
find a match. For example, one can assign a novel translation task to a worker who
likes novels. Then, we apply expert finding techniques to match HITs to our crowd
participants based on indexed profiles that we build from their social network
information. Our experimental system, Pick-A-Crowd, is a custom Facebook [5]
application that assigns tasks to its users automatically based on what they liked
on the social network.
Difallah, Djellel Eddine, Gianluca Demartini, and Philippe Cudré-Mauroux. "Pick-
a-crowd: tell me what you like, and i’ll tell you what to do." Proceedings of the 22st
international conference on World Wide Web. ACM, 2013.
B) Efficiency designates the ability of a system to make the best use of the time, effort, and
budget in carrying the task at hand. In this thesis, we are interested in reducing the
latency and in enabling HIT prioritization in batches of homogenous or heterogeneous
HITs. We focus specifically on:
HIT Retention: Batches of tasks published on a crowdsourcing platform might be sub-
ject to slow progress and even to stagnation, especially when only a few tasks are
left. We investigate worker retention as a new dimension in increasing crowdsourc-
ing throughput or to avoid stagnation. In our work, we achieve worker retention
by granting punctual bonuses to the active workers.
Difallah, Djellel Eddine, Michele Catasta, Gianluca Demartini, and Philippe Cudré-
Mauroux. "Scaling-Up the Crowd: Micro-Task Pricing Schemes for Worker Retention
and Latency Improvement." Second AAAI Conference on Human Computation and
Crowdsourcing. 2014.
HIT Scheduling: Crowd-powered systems can be multi-tenant, i.e., supporting work-
loads generated by concurent users. A traditional approach would publish a new
batch of tasks on the crowdsourcing platform for each incoming query. We ar-
gue that this is suboptimal for the overall efficiency of the system. Instead, we
propose to bundle heterogeneous tasks in a single batch. We take control of the
6
1.4. Summary of Contributions
HIT serving schedule in order to seamlessly load-balance the available workers
on multiple heterogeneous HITs. This work is currently submitted for peer review
(see Chapter 7).
1.4.1 Additional Contributions
In addition to the core contributions of this thesis, which are listed above, we also published
the following pieces of work related to crowdsourcing.
1. We studied the data collected from AMT over the past five years, and analyzed a number
of key dimensions of the platform (see Chapter 3).
Difallah, Djellel Eddine, Michele Catasta, Gianluca Demartini, Ipeirotis, Panagiotis G.,
and Philippe Cudré-Mauroux. "The Dynamics of Micro-Task Crowdsourcing – The Case
of Amazon MTurk". Proceedings of the 24th international conference on World Wide Web.
ACM, 2015.
2. We presented a position paper where we first review the techniques currently used to
detect spammers and malicious workers (whether they are bots or humans) randomly
or semi-randomly completing tasks. Then, we describe the limitations of existing
techniques by proposing approaches that individuals, or groups of individuals, could
use to attack a task on a crowdsourcing platforms.
Difallah, Djellel Eddine, Gianluca Demartini, and Philippe Cudré-Mauroux. "Mechani-
cal Cheat: Spamming Schemes and Adversarial Techniques on Crowdsourcing Platforms."
CrowdSearch. 2012.
3. We contributed to Hippocampus, a “Transactive Search” system that answers memory-
based queries by involving a group of people who have vivid memories of an event or
an interaction. In that work, we compare autmated methods, AMT crowd, and personal
social networks.
Michele Catasta, Alberto Tonon, Djellel Eddine Difallah, Gianluca Demartini, Karl Aberer,
and Philippe Cudre-Mauroux. "Hippocampus: answering memory queries using transac-
tive search." Proceedings of the companion publication of the 23rd international confer-
ence on World wide web companion. International World Wide Web Conferences Steering
Committee, 2014.
4. As an extension to Transactive Search, we describe the necessary components, the
architecture and the research directions for building a “transactive data management
system” that leverages social networks and crowdsourcing. TransactiveDB allows users
to pose different types of transactive queries in order to reconstruct collective human
memories.
Michele Catasta, Alberto Tonon, Djellel Eddine Difallah, Gianluca Demartini, Karl Aberer,
and Philippe Cudre-Mauroux. "TransactiveDB: Tapping into Collective Human Memo-
ries." Proceedings of the VLDB Endowment 7.14 (2014).
7
Chapter 1. Introduction
1.5 What this Thesis is Not About
Human computation is a multidisciplinary field, spanning from Human-Computer Inter-
action (HCI) to game theory, computer science, business, economics, social science, and
psychology. There might be a benefit in integrating multiple aspects from those research
areas to achieve better QoS in crowd-powered systems. In fact, other research agendas aim
at designing better user interfaces to obtain faster response times for a particular task [100],
add gamification elements to eliminate or reduce the cost [156], or leverage social networks
to increase the audience [54]. However, given that the field is still in its infancy, we choose to
focus on problems that aim at enhancing such systems by solely considering paid micro-task
crowdsourcing as a paradigm, and without necessarily integrating other techniques.
We do not to build specific crowd-powered operators (see Section 2.2.2) but rather assume
the ones supported by the crowd-powered system. Often, in the literature, such operators
propose ad-hoc quality assurance methods. We believe that these methods must be the raison
d’être of a separate and extensible quality assurance module.
Finally, we do not aim at creating a new crowdsourcing platform, although we often felt that
some key features were missing were AMT that could greatly benefit the QoS, and in few cases
we built custom solutions to showcase those benefits (see Chapter 5).
1.6 Outline
We organize the rest of this thesis as follows. Chapter 2 reviews relevant work on crowd-
powered systems and algorithms, in addition to existing methods tackling effectiveness and
efficiency in crowdsourcing. Next, in Chapter 3 we delve into an analysis of AMT; Understanding
and characterizing our target crowdsourcing platform will eventually help us formulate some
of our design choices. In Chapter 4 we present and evaluate our quality assurance method. We
center our study around a data integration use-case, where we create a hybrid human-machine
system for entity linking and instance matching. Chapter 5 introduces push crowdsourcing,
a model that “routes” tasks to better fit workers. Our prototype is a Facebook application
that suggests tasks to its users based on their social profiles. In Chapter 6 we consider worker
retention as a new dimension in speeding up the execution of a batch of tasks and minimize
its stagnation. We evaluate several bonus schemes that aim at retaining crowd workers longer
on a given batch. Chapter 7 introduces and evaluates scheduling algorithms that optimize
the execution of heterogeneous tasks while minimizing context switching for the workers. We
conclude withChapter 8 that summarizes our main findings, future directions, as well as our
outlook for future developments in crowdsourcing.
8
2 Background in Crowd-Powered Sys-tems
2.1 Introduction
Since its introduction in 2005 by Amazon Mechanical Turk, paid micro-task crowdsourcing has
been studied and applied for a range of purposes including entity resolution, entity linking,
schema matching, association rule mining, word sense disambiguation, relevance judgement,
and query answering. Such hybrid human-machine systems use crowdsourcing in order to
provide better solutions as compared to purely machine-based systems.
In the following, we give some background on algorithms and systems leveraging human
computation, and the techniques used for quality assurance, routing and scheduling of Human
Intelligence Tasks (HITs) which are the main themes covered in this thesis. Additional related
work is also covered in the corresponding chapters.
2.2 Crowd-Powered Systems
2.2.1 Crowd-Powered Database Systems
In the database community, hybrid human-machine data management systems were pro-
posed with CrowdDB [64], Qurk [117] and Deco [128]. While the architectural details of those
systems differ, their core concept suggests adding new modules to the system that interact
with the crowd via the target crowdsourcing platform API. The query language is extended
in order to support declarative operators that process selected records with the help of the
crowd. The query execution engine supports new query operators – usually in the form of
User Defined Functions (UDFs) – that encode a template comprised of a visual interface
descriptor1 to display the HIT, an operator signature that defines the input and the output
of each HIT, the task reward, optional HIT pre-requisites, in addition to any processing logic.
When such operators are invoked, their execution triggers the generation of HITs onto the
platform, through the API, along with the provided input.
1Any format that the used crowdsourcing platform API supports, e.g., HTML, XML, JSON.
9
Chapter 2. Background in Crowd-Powered Systems
or CancelSubmit
Task: Gender Identification in CCTV Images
Female
Male
Unknown
Please, look at the picture and select the correct matching gender of the individual appearing in it.If unsure, select 'Unknown'.
Reward $0.05 Remaining 30582
Figure 2.1 – A Human Intelligence Task mockup.
Consider the following use-case: The security cameras of a mall capture and store snapshots
into a database relation VISITORS. During a security check, the administrator needs to find
pictures of ’All male visitors from the last hour’. Since the gender information is not present
in any column of the relation, nor the snapshots are annotated, the only way to uncover this
information is by checking the image field for each record. Given the size of the table (several
thousands records) and a short time window, the administrator decides to run his query using
a crowd-operator getGender(Type Image) which was added to the system beforehand. This
operator takes the snapshot as input, and is expected to return the gender of the pictured
subject. A mockup of the HIT interface as presented to the crowd is shown in fig. 2.1, and the
query writen by the user in listing 2.1.
1 SELECT * FROM VISITORS v2 WHERE getGender(v.picture) like "male"3 AND v.date >= DATE_SUB(NOW(),INTERVAL 1 HOUR);
Listing 2.1 – Sample operator of a crowd-powered DBMS.
In the realm of database systems, crowdsourcing can be used to fill-in null values in tuples, or
to define subjective ‘ORDER BY’ operators that allow users to express queries such as ‘Sort
by scariest movie’. CrowdDB, in particular, goes beyond the usual closed world assumption
of database systems, which states that: what is not present in the database must not exist.
In fact, CrowdDB supports operators that can add new tuples to a relation, e.g., ‘Insert the
name, address and phone number of bakeries in Boston, MA’. This presents new challenges
to database systems in how to handle query optimization, especially since the cardinality
of such tables is not previously known [114]. Selke et al. [141] extend these ideas to cover
maleable database schemas, that is the ability to add new columns by probing the crowd when
a relevant information might be associated to a record. More recently, Catasta et al. proposed
TransactiveDB [37], a system that reconstructs non-transcribed information from collective
memories; here, social networks and personal acquaintances are leveraged to find pieces of
information.
10
2.2. Crowd-Powered Systems
Jeffery et al. [82] propose to hide the process of crowdsourcing from the user by defining
the concept of Labor Independence. Their goal is to simplify the declartive language used by
systems like CrowdDB which forces the users to be aware of the underlying crowdsourcing
process at the record level. Instead, their system Arnold takes generic parameters (expected
quality and total budget) to automatically crowdsource records within those criterias.
2.2.2 Crowd-Powered Database Operators
A natural development in crowd-powered database systems was the study of SQL-like oper-
ators tailored to the crowd. For instance, Parameswaran et al. [129] investigate the filtering
operation, which consists in applying a set of conditions (or predicates) to filter out unmatch-
ing tuples. The main issue they tackle, is how to reach a consensus out of multiple noisy
answers collected from the crowd, and run additional tasks if required. In their work, they
propose both an optimal and a heuristic strategy.
Marcus et al. [116] studied the Join and Sort operators, where they conclude that for a join
operation, a one-to-many join interface was optimal – as compared to a full pairwise cross-
join. On the other hand, for sort operations, they show that using a rating system instead
of a pairwise record comparison required far less HITs, yet producing similar results. In a
subsequent work [115], Marcus et al. created a Count operator which again leverages batching
as an efficient technique to dramatically lower the number of HITs to crowdsource. Here,
batching consists in showing multiple records to workers and ask them to provide a close
estimate count. Wang et al. [165] take advantage of transitive relation properties to further
reduce the set of elements to crowdsource in the case of Join operators.
Guo et al. [69] focus on the Max operator that finds the maximum element in a set of pairwise
comparisons. The problem is far from obvious when optimizing the pairwise comparison
operations; some workers might take longer to answer a HIT, or, might provide an incorrect
answer; thus, the query is executed by arbitrary pairwise comparisons rather than a predefined
tournament-like order. The authors show that the problem at hand is NP-hard and provide
a heuristic to estimate the max, and a method to decide what pair to crowdsource next in
order to improve the results. Venetis et al. [154] develop a set of generic parameterized max
algorithms considering time, cost, and quality tradeoffs. Along the same line, top-k algorithms,
combining heuristics and crowdsourcing, have been proposed in [47, 109, 124, 131].
As introduced in the previous section, CrowdDB allows some operators to add new rows to a
relation. A sample application to this is item enumeration, with the popular query “List all
possible ice cream flavors”. Trushkowsky et al. [152] used a statistical approach, inspired by
species estimations algorithms, to reason about the progress of an enumeration query and
estimate the size of the set.
The previous operators are mainly static, fulfilling their objective on a fixed set of elements.
Mahmood et al. [111] propose to use a crowdsourced index for operations that would require
11
Chapter 2. Background in Crowd-Powered Systems
search, update and deletion of records. Their Palm-Tree index is constructed with the help of
the crowd and is based on a B+Tree.
2.2.3 Crowd-Powered Systems in Other Communities
In the Information Retrieval community, crowdsourcing is especially appealing for creating
relevance judgments to evaluate the results of a search system. In fact, this operation is usually
carried out by expert judges. However, the later cannot handle the requests of all researchers.
As such, crowdsourcing has been used to produce relevance judgments for documents [12],
books [85, 86], or entities [28].
Another example is the use of crowdsourcing to answer tail queries in Web Search Engines [25].
Tail queries are keywords that rarely appear in search engine logs, as opposed to more popular
terms. Here the goal is to ask the crowd to select the most appropriate link to a tail query
within a set of machine-selected candidate Web pages. More recently, Demartini et al. [51]
proposed CrowdQ, a system that helps answering search queries leveraging the cognitive
ability of the crowd workers. Although the system do not opperate in real time, the crowd help
creating generic templates that can be applied for future queries.
In the Semantic Web community, crowdsourcing has also been recently considered, for in-
stance, to link [50] or map [139] entities. In both cases, the use of crowdsourcing can signifi-
cantly improve the quality of generated links, or mappings, as compared to purely automatic
approaches. In the context of Natural Language Processing, games to crowdsource the Word
Sense Disambiguation task[140] have recently been proposed.
Amsterdamer et al. [13] introduce the concept of Crowd Mining; that is: retrieving interesting
facts and rules directly from the crowd. Specifically, they study the case of association rule
mining without a pre-existing database of transactions. Their system, CrowdMiner, asks the
crowd to provide directly such rules from their own experience, which is an interesting case
leveraging humans’ ability to summarise information and to infer facts.
In the case of large enterprises, knowledge is often distributed across a number of employees.
Crowdsourcing within an enterprise (i.e., when the crowd is composed of th e company em-
ployees) is becoming popular and can benefit from the fact that employees are domain experts
and can solve tasks better and faster than anonymous crowds. In this case, crowdsourcing can
be used, for example, to efficiently find solutions to operational issues [160]. Crowdsourcing
has also been used in the biomedical domain where, for example, ontological relations among
diseases can be validated by the crowd [122, 132].
2.2.4 Languages and Toolkits
Apart from the regular AMT API, a whole echosystem of tools and paradigms similar to pro-
gramming languages tailored to crowdsourcing start appearing [121]. Such methods expose
12
2.3. Task Quality
easier abstractions over the crowd, allowing system designers to transparently crowdsource
some of their processing needs.
For example, the requesters might have complex jobs that can be decomposed into pipelines of
micro-tasks (or HIT workflows). CrowdForge [91] provides a framework for task decomposition
and merging using map-reduce style of programming. The crowd is not only asked to complete
tasks but also to partition larger ones and merge results. CrowdWeaver [89] is a visual interface
that simplifies the management of such workflows and allows progress monitoring of the
whole system.
TurKit [108] automates the execution of iterative tasks without the manual intervention of an
operator. As an example, consider the case of spell checking a short paragraph. We can quickly
create a single HIT containing the original text. After the first iteration, TurKit takes the output
produced by the previous worker and creates a new HIT engaging a different worker. This
process then continues until we reach a predefined number of iterations.
As AMT is used more and more by non-programmers, especially in research areas such as
behavioural science and psychology. PsiTurk [8] is an automation tool that lowers the entry
barrier to AMT by providing an open platform for exchanging reusable code and designs of
experiments.
2.3 Task Quality
Quality control is a common issue of paid crowdsourcing. This is the case for several reasons
including:
• Human intrinsic factors (e.g., fatigue, boredom, priming, bias, hastiness) which can
affect some answers given by the workers.
• The results are hardly verifiable, and the requesters cannot check the answers one by
one; as this would defeat the whole purpose of crowdsourcing the task in the first place.
• There are some unfaithful workers whose intent is to game the system in order to collect
the monetary reward without properly completing the task [55].
2.3.1 Task Repetitions
In order to avoid bad quality answers, the same HIT is usually offered multiple times to
distinct workers; once all the tasks get completed; the requester decides what answers to
pick and how to aggregate them. The primary goal of task repetition is to diversify the output
by asking different workers and potentially improve the quality of a single HIT – the error
of the involved workers is usually assumed to be independant. Task repetition comes with
the price of multiplying the cost by the number of required repetitions. In some cases it is
possible to automatically decide if a new repetition of the HIT is needed and thus can be done
on demand (we refer the reader to the related work on crowdsourcing labels for supervised
13
Chapter 2. Background in Crowd-Powered Systems
machine learning [23, 78, 144, 148]).
2.3.2 Test Questions
Next, a simple screening process could be used to quickly identify malicious workers [60]. The
requester can use k HITs2 as tests tasks that the worker has to pass before, or during, a work
session. We can categorize test questions as follows:
• Gold Standard Tasks: The requester can create and add a set of undistinguishable yet
verifiable HITs into his larger batch. Those HITs can be inserted randomly during the
work session.
• Qualification Tasks: AMT provides the ability to request specific qualifications from the
workers. Those qualifications can either be drawn from the gold standards or, might
consist of more generic tasks, e.g., verify that a worker is fluent in French.
• Turing Test Tasks: Such questions (e.g., Captcha) are widely used to stop bots, they can
also be generated indefinitely. Here, the requester will not have to worry about creating
a test set of questions.
Test questions are powerful tools to detect malicious workers, especially when the task cannot
be differentiated from regular tasks. However, they come at a cost: for large batches of HITs, a
bigger gold standard collection is needed to avoid that the workers spot recurrent questions.
Moreover, test questions should be selected carefully so that they do not trick real workers
and, they are not easy for robots to answer.
2.3.3 Result Aggregation
The aggregation of the final results is a well-studied topic; the most straightforward approach
is to proceed with a majority vote; a simple, yet rather effective approach [104]. The authors
of [73] formalized the majority vote approach and proposed the use of a control group that
double-checks the answers of a prior run.
The distribution of workers and the number of tasks they perform are usually characterized
by a power law distribution [12] where many workers do few tasks, and few workers do many
tasks. The quality of aggregated results in such a context (e.g., with a majority vote) is self-
contained in the judgment of the task. However, one can capture many more signals that can
help make a better aggregation decision. Using machine learning algorithms and statistics,
one can model tasks complexity, workers’ error rate, skills and maliciousness [49, 50, 59, 79,
135, 162, 166, 168]. Sheshadri et al. surveyed and assessed many of these methods in their
SQUARE Benchmark [145].
While requesters can directly benefit from those methods without any additional cost; their
complexity, and sometimes meager improvements, make majority often the preferred ap-
2k is usually significantly smaller than the total number of HITs.
14
2.4. Task Routing
proach.
2.3.4 Task Design
The HIT design is usually the responsibility of the requester. It encompases the visual rep-
resentation of the tasks, the clarity of the instructions, the communication style, and even
bonus mechanisms, which can be critical to boost quality, speed or both for any crowdsourc-
ing campaign. A study on the impact of incentives was recently conducted in [142]. The
authors observed that crowdsourcing platforms favor monetary incentives instead of social
ones and hypothesized that explicit worker conditioning (e.g., inform the worker that upon
disagreement with other workers on the same task they will be sanctioned) in addition to
quality control, can lead to better result quality.
Kittur et al. [88] stressed the importance of task formulation and presented their results with
two variants of a given task formulated differently. On a different note, Eickhoff et al. [61]
observed that malicious workers are less attracted to “Novel tasks that involve creativity and
abstract thinking”.
2.4 Task Routing
Another mechanism used for quality assurance is routing tasks to workers who might possess
some knowledge or background to provide a higher quality answer. Dommez et al. [58]
use an exploration-exploitation technique where: first they estimate the accuracy of the
crowd workers, and early in the process drop those workers who fall bellow a threshold while
optimizing their task to worker assignement.
While the task assignement process can be controlled to some extent intra batch on AMT, a
more generic approach is to use the concept of push crowdsourcing that we introduce in [56],
where the crowdsourcing platform itself disposes of workers profiles and can actively assign
a HIT to them if it sees fit. Pushing tasks can even be done offline, i.e., a task assigned to a
worker who is not currently on the platform. A particulary appealing application of pushing
tasks is in mobile based crowdsourcing [94].
In their early version, MobileWorks [6] described an architecture for adding tasks to a queue
and then “routing” them to one or multiple adequate workers [97]. However, their technique
of identifying worker expertise is not described.
Karger et al. [84] show that an adaptive task assignment system, one that dynamically decides
to whom to assign the task next, is not optimal if the crowd workers are ephemerals. Among
their conclusions is the following: “building a reliable worker-reputation system is essential to
fully harnessing the potential of adaptive designs.”
Ipeirotis et al. [77] proposed a novel crowdsourcing system based on Google Ads to target a
15
Chapter 2. Background in Crowd-Powered Systems
crowd population interested in some niche domains. The crowd participation is volluntary
but the system still uncures the advertizing cost. The later can be alleviated, or supressed, in
the case of non-profit use.
2.5 Task Scheduling
In this thesis, we consider a system oriented definition of task scheduling; where a set of
heterogeneous HITs (processes), possibly submitted by multiple requesters (tenants), are given
different priorities to get executed by the available crowd on the platform (ressources). The
goal here is to achieve load balancing and minimize the overall latency that requesters might
experience otherwise. To our knowledge, there is little work in this area thus far. Nonetheless,
in the following, we refer to related work where the term “scheduling” was used even if it does
not strictly match our definition.
So far, scheduling HITs for the crowd has been addressed in the context of work quality, and
often mixed with task routing (see section 2.4). CrowdControl[134], for example, proposes
scheduling approaches that take into account the history of the workers to understand how
to assign HITs best to workers based on how they learn doing tasks. Similarly, SmartCrowd
[137] considers task assignment as an optimization problem based on worker skills and their
reward requirements. Another work looking at the quality dimension is [125] where authors
look at scheduling tasks according to the required skills and previous feedback of requesters.
Khazankin et al. [87] proposed an architecture for a crowdsourcing platform that can provide
Service Level Agreements to its requester with a particular focus on work quality. Authors show,
by means of simulations, how approaches that take into account worker skills outperform
standard scheduling methods.
A different type of scheduling has been addressed in [53], where authors look at crowdsourcing
tasks that need to take place in a specific real-world geographical location. In this case it is
necessary to schedule tasks for workers in order to minimize space movements by taking into
account their geographical location and path.
Task allocation in teams has been studied in [14] where authors defined the problem, examined
its complexity, and proposed greedy methods to allocate tasks to teams and adjust team size.
Team formation given a task has been studied in [15] looking at worker skills.
2.6 Conclusions
Human Computation (HC) has been intensively studied over the past few years. One of the
major trends in that context is the study and characterization of HC processes from a com-
puter science perspective. Davis recently proposed the concept of Human co-Processing Units
(HPUs) [48] to model HC components along with CPUs or GPUs on computational platforms.
16
2.6. Conclusions
Many other researchers, on the other hand, believe that humans behave fundamentally differ-
ently from machines and that radically new abstractions are required in order to characterize
(and potentially predict) the behavior of the crowd. While this debate is conceptual and ethical
mostly, our contributions are mainly technical; We design new algorithms to manage the
crowd more efficiently and effectively, and experimentally compare the effects of various HIT
scheduling, pricing schemes, routing techniques on crowd-powered systems, and hope that
the results gathered in this context will be instrumental in better understanding and managing
the crowd.
In the next chapter we start by analyzing AMT, the main target crowdsourcing platform used in
our evaluations, to develop a better understanding of the different characteristics that drive
the performance of the crowdsourcing campaigns running on it.
17
3 An Analysis of the Amazon Mechani-cal Turk Crowdsourcing Marketplace
3.1 Introduction
The efficiency and effectiveness of a crowd-powered system heavily depend on the target
crowdsourcing platform. Partly because of different crowd demographics, size of the crowd,
available work, and competing requesters. Such factors can have a significant influence on
the quality of the results and the speed of a crowdsourcing campaign. In this thesis, we mainly
used AMT as a crowdsourcing platform for our empirical evaluations. In this section, we start
by analyzing this platform using a five-years log containing information about the posted HITs
and their progress status obtained from mturk-tracker.com [76]. We report key findings of
some of the factors that shape the dynamics of this platform. Such findings will eventually
help us explain or design some of the proposed methods in this thesis. Moreover, using
features derived from a large-scale analysis of these logs, we propose a method to predict
the throughput of a batch of HITs published by a certain requester at a certain point in time.
This prediction is based on several features including the current platform load and tasks
types. Using this prediction method, we try to understand the impact of each feature that we
consider, and its scope over time.
The main findings of our analysis are: 1) the type of tasks published on the platform has
changed over time with content creation HITs being the most popular today; 2) the HIT
pricing approach evolved towards larger and higher paid HITs; 3) geographical restriction are
applied to certain task types (e.g., surveys for US workers); 4) we observe a consistent growth
in the number of new requesters who use the platform; 5) we identify size of the batch as the
main feature that impacts the progress of a given batch; 6) we observe that supply (workforce)
has little control over driving the price of demand (posted HITs).
In summary, the main contributions of this analysis are:
• An analysis of the evolution of a popular micro-task crowdsourcing platform looking at
dimensions like topics, reward, worker location, task types, and platform throughput.
• A large-scale classification of 2.5M HIT types published on AMT.
19
Chapter 3. An Analysis of the Amazon Mechanical Turk Crowdsourcing Marketplace
• A predictive analysis of HIT batch progress using more than 29 different features.
• An analysis of the crowdsourcing platform as a market (demand and supply).
The rest of the chapter is structured as follows. In Section 3.2, we overview recent work on
micro-task crowdsourcing specifically focusing on how micro-task crowdsourcing has been
used and on how it can be improved. Section 3.3 presents how AMT has evolved over time in
terms of topics, reward, and requesters. Section 3.4 summarizes the results of a large-scale
analysis on the types of HIT that have been requested and completed over time. Based on
the previous findings, Section 3.5 presents our approach to predicting the throughput of the
crowdsourcing platform for a batch of published HITs. Section 3.6 studies the AMT market and
how different events correlate (e.g., new HITs attracting more workers to the platform). We
discuss our main findings in Section 3.7 before concluding in Section 3.8.
3.2 Related Work
AMT Market Analysis and Prediction
An initial work analyzing AMT market was done in [76], here we extend on this work by consid-
ering the time dimension and analyze long term trend changes. Faradani et al. [62] proposed
a model to predict the completion time of a batch. Our prediction endeavor is however differ-
ent, in the sense that we try to predict the immediate throughput based on current market
condition and try to understand what features are having more impact than others.
The Future of Crowdsourcing Platforms
In [90] authors provide their own view on how the crowdsourcing market should evolve in the
future, specifically focusing on how to support full-time crowd workers. Similarly to them, our
goal is to identify ways of improving crowdsourcing marketplaces by understanding the dy-
namics of such platforms—based on historical data and models. Our work is complementary
to existing work as we present a data-driven study of the evolution of micro-task crowdsourc-
ing over five years. Our findings can be used as support evidence to the ongoing efforts in
improving crowdsourcing quality and efficiency that are described above. Our work can be
also used to support requesters in publishing HITs on these platforms and getting results more
efficiently.
Online Reputation
Many AMT workers share their experience about HITs and requesters through dedicated web
forums and ad-hoc websites [80]. Requester “reviews” serve as a way to measure the reputation
of the requesters among workers and it is assumed to influence the latency of the tasks
published [146], as workers are naturally more attracted by HITs published by requesters with
a good reputation.
20
3.3. The Evolution of Amazon MTurk From 2009 to 2014
3.3 The Evolution of Amazon MTurk From 2009 to 2014
In this section, we start by describing our dataset and extract some key information and
statistics that we will use in the rest of the chapter.
3.3.1 Crowdsourcing Platform Dataset
Over the past five years, we have periodically collected data about HITs published on AMT. The
data that we collect from the platform is available at http://mturk-tracker.com/.
In this work, we consider hourly aggregated data that includes the available HIT batches and
their metadata (title, description, rewards, required qualifications, etc.), in addition to their
progress over time, that is, the temporal variation of the set of HITs available. In fact, one of
the main metrics that we leverage (see Section 3.5) is the throughput of a batch, i.e., how many
HITs get completed between two successive observations. In Figure 3.1, we plot the number
of HITs available in a given batch versus its throughput. An interesting observation that can
be made is that large batches can achieve high throughput (thousands of HITs per minute).
In total, our dataset covers more than 2.5M different batches with over 130M HITs. We note
that the tracker reports data periodically only and does not reflect fine-grained information
(e.g., real-time variations). We believe however that it captures enough information to perform
meaningful, long-term trend analyses and to understand the dynamics and interactions within
the crowdsourcing platform.
3.3.2 A Data-driven Analysis of Platform Evolution
First, we identify trends obtained from aggregated information over time, keywords, and
countries associated to the published HITs. Each of the following analyses is also available as
an interactive visualization over the historical data on http://xi-lab.github.io/mturk-mrkt/.
Topics Over Time First, we want to understand how different topics have been addressed
by means of micro-task crowdsourcing over time. In order to run this analysis, we look at the
keywords associated with published HITs. We observe the evolution of keyword popularity
and associated reward on AMT. Figure 3.2 shows this behavior. Each point in the plot represents
a keyword associated to the HITs with its frequency (i.e., number of HITs with this keyword)
on the x-axis, and the average reward in a given year on the y-axis. The path connecting data
points indicates the time evolution, starting in 2009, with one point representing the keyword
usage over one year.
We observe that the frequency of the ‘audio’ and ‘transcription’ keywords (i.e., blue and red
paths from left to right) have substantially increased over time. They have become the most
21
Chapter 3. An Analysis of the Amazon Mechanical Turk Crowdsourcing Marketplace
Figure 3.1 – Batch throughput versus number of HITs available in the batch. The red line cor-responds to the maximum throughput we could have observed due to the tracker periodicityconstraints. For readability, this graph represents a subset of 3 months (January-March 2014),and HITs with rewards $0.05 and less.
popular keywords in the last two years and are paid more than $1 on average. HITs with the
‘video’ tag have also increased in number with a reward that has reached a peak in 2012 and
decreased after that. HITs tagged as ‘categorization’ have been paid consistently in the range
of $0.10-$0.30 on average, except in 2009 where they were rewarded less than $0.10 each.
HITs tagged as ‘tweet’ have not increased in number but have been paid more over the years,
reaching $0.90 on average in 2014: This can be explained by more complex tasks being offered
to workers, such as sentiment classification or writing of tweets.
Preferred Countries by Requesters Over Time Figure 3.3 shows the requirements set by
requesters with respect to the countries they wish to select workers from. The left part of
Figure 3.3 shows that most HITs are to be completed exclusively by workers located in the
US, India, or Canada. The right part of Figure 3.3 shows the evolution over time of the
country requirement phenomenon. The plot shows the number of HITs with a certain country
requirement (on the y-axis) and its time evolution (on the x-axis) with yearly steps. The size of
the data points indicates the total reward associated to those HITs.
We observe that US-only HITs dominate, both in terms of their large number as well as in
terms of the reward associated to them. Interestingly, we notice how HITs for workers based in
India have been decreasing over time. On the other hand, HITs for workers based in Canada
have been increasing over time, becoming in 2014 larger than those exclusively available to
22
3.3. The Evolution of Amazon MTurk From 2009 to 2014
1,000 10,000 100,000Frequency (log)
Average Reward (log)
10
1
0.1
0
Figure 3.2 – The use of keywords to annotate HITs. F r equenc y corresponds to how manytimes a keyword was used, and Aver ag eRew ar d corresponds to the average monetary rewardof batches that listed the keyword. The size of the bubbles indicates the average batch size.
workers based in India. We also see that the reward associated to them is smaller than the
budget for India-only HITs. As of 2014, both HITs for workers based in Canada or UK are more
numerous that those for workers based in India. Overall, 88.5% of the HIT batches that were
posted in the considered time period did not require any specific worker location. 86% of
those which did, imposed a constraint requesting US-based workers.
Figure 3.4 shows the top keywords attached to HITs restricted to specific locations. We observe
that the most popular keywords (i.e., ‘audio’ and ‘transcription’) do not require country-specific
workers. We also note that US-only HITs are most commonly tagged with ‘survey’.
HIT Reward Analysis Figure 3.5 shows the most frequent rewards assigned to HITs over
time.1 We observe that while in 2011 the most popular reward was $0.01, recently HITs paid
$0.05 are getting more frequent. This can be explained both by how workers search for HITs on
AMT and by the AMT fee scheme. Requesters now prefer to publish more complex HITs possibly
with multiple questions in them and grant a higher reward: This also attracts those workers
who are not willing to complete a HIT for small rewards and reduces the fees paid to AMT,
which are computed based on the number of HITs published on the platform.
Requester Analysis In order to be sustainable, a crowdsourcing platform needs to retain
requesters over time or get new requesters to replace those who do not publish HITs anymore.
1Data for 2014 has been omitted as it was not comparable with other year values.
23
Chapter 3. An Analysis of the Amazon Mechanical Turk Crowdsourcing Marketplace
10
100
1000
10000
100000
1000000
Total HITsRequested
106
Time
Cumulative HITs (log)
105
104
2009 2010 2011 2012 2013 2014
Figure 3.3 – HITs with specific country requirements. On the left-hand side, the countrieswith the most HITs dedicated to them. On the right-hand side, the time evolution (x-axis) ofcountry-specific HITs with volume (y-axis) and reward (size of data point) information.
NO-Location NON-US US
0
200000
400000
600000
Article
Audio
Crowd
Easy
Editing
Insurance
Psychology
Quick
Research
Survey
Transcription
Verbatim
Voicemail
Article
Audio
Crowd
Easy
Editing
Insurance
Psychology
Quick
Research
Survey
Transcription
Verbatim
Voicemail
Article
Audio
Crowd
Easy
Editing
Insurance
Psychology
Quick
Research
Survey
Transcription
Verbatim
Voicemail
Keywords
Count
Figure 3.4 – Keywords for HITs restricted to specific countries.
Figure 3.6 shows the number of new requesters who used AMT and the overall number of
active requesters at a certain point in time. We can observe an increasing number of active
requesters over time and a constant number of new requesters who join the platform (at a rate
of 1,000/month over the last two years).
It is also interesting to look at the overall amount of reward for HITs published on the platform,
as platform revenues are computed as a function of HIT reward. From the bottom part of Figure
3.6, we observe a linear increase in the total reward for HITs on the platform. Interestingly, we
also observe some seasonality effects over the years, with October being the month with the
highest total reward and January or February being the month with minimum total reward.
HIT Batch Size Analysis When a lot of data needs to be crowdsourced (e.g., when many
images need to be tagged), multiple tasks containing similar HITs can be published together.
We define a batch of HITs as a set of similar HITs published by a requester at a certain point in
time.
Figure 3.7 shows the distribution of batch sizes in the period from 2009 to 2014. We can
24
3.4. Large-Scale HIT Type Analysis
0
25000
50000
75000
100000
2009 2010 2011 2012 2013Year
Count
Micro Reward (USD) 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
Figure 3.5 – Popularity of HIT reward values over time.
0
500
1000
0
1000
2000
3000
4000
0
25000
50000
75000
100000
125000
New.RequestersD
istinct.R
equesters
Total.Reward
2009 2010 2011 2012 2013 2014Year
Count
Figure 3.6 – Requester activity and total reward on the platform over time.
observe that most of the batches were of size 1 (more than 1M), followed by a long tail of larger,
but less frequent, batch sizes.
Figure 3.8 shows how batch size has changed over time. We observe that the average batch
size has slightly decreased. The monthly median is 1 (due to the heavily skewed distribution).
Another observation that can be made is that in 2014 very large batches containing more that
200,000 HITs have appeared on AMT.
3.4 Large-Scale HIT Type Analysis
In this section, we present the results of a large-scale analysis of the evolution of HIT types
published on the AMT platform. For this analysis, we used the definition of HIT types proposed
by [65] in which authors perform an extensive study involving 1,000 crowd workers to under-
stand their working behavior, and categorize the types of tasks that the crowd perform into six
25
Chapter 3. An Analysis of the Amazon Mechanical Turk Crowdsourcing Marketplace
100
101
102
103
104
105
106
10
0
10
1
10
2
10
3
10
4
10
5
Batch Size (log)
Nu
mb
er
of B
atc
he
s (
log
)
Figure 3.7 – The distribution of batch sizes.
top-level “goal-oriented” tasks, each containing further sub-classes. We briefly describe the
six top-level classes introduced by [65] below.
• Information Finding (IF): Searching the Web to answer a certain information need. For
example, “Find the cheapest hotel with ocean view in Monterey Bay, CA”.
• Verification and Validation (VV): Verifying certain information or confirming the validity
of a piece of information. Examples include checking Twitter accounts for spamming
behaviors.
• Interpretation and Analysis (IA): Interpreting Web content. For example, “Categorize
product pictures in a predefined set of categories”, or “Classify the sentiment of a tweet”.
• Content Creation (CC): Generating new content. Examples include summarizing a
document or transcribing an audio recording.
• Surveys (SU): Answering a set of questions related to a certain topic (e.g., demographics
or customer satisfaction).
• Content Access (CA): Accessing some Web content. Examples include watching online
videos or clicking on provided links.
3.4.1 Supervised HIT Type Classification
Using the various definitions of HIT types given above, we trained a supervised machine
learning model to classify HIT types based on their metadata. The features we used to train
the Support Vector Machine (SVM) model are: HIT title, description, keywords, reward, date,
allocated time, and batch size.
To train and evaluate the supervised model, we created labelled data: We uniformly sampled
5,000 HITs over the entire five-year dataset and manually labelled their type by means of
crowdsourcing. In detail, we asked workers on MTurk to assign each HIT to one of the
26
3.4. Large-Scale HIT Type Analysis
0
100
200
300
0
100000
200000
300000
Average.BatchSize
Maximum.BatchSize
2009 2010 2011 2012 2013 2014Year
Count
Figure 3.8 – Average and maximum batch size per month. The monthly median is 1.
predefined classes by presenting them with the title, description, keywords, reward, date,
allocated time, and batch size for the HIT. The instructions also contained the definition and
examples for each task type. Workers could label tasks as ‘Others’ when unsure or when the
HIT did not fit in any of the available options.
After assigning each labelling HIT to three different workers in the crowd, a consensus on the
task type label was reached in 89% of the cases (leaving 551 cases with no clear majority). A
consensus was reached when at least two out of three workers agreed on the same HIT type
label. The other cases, that is, when the workers provided different labels or when they where
not sure about the HIT type, have then been removed from our labelled dataset.
Using the labelled data, we trained a multi-class SVM classifier for the 6 different task types
and evaluated its quality with 10-fold cross validation over the labelled dataset. Overall, the
trained classifier obtained a Precision of 0.895, a Recall of 0.899, and an F-Measure of 0.895.
Most of the classifier errors (i.e., 66 cases) were caused by incorrectly classifying IA instances
as CC jobs.
Performing feature selection for the HIT type classification problem, we observed that the
best features based on information gain are the HIT allotted time and reward: This indicates
that HITs of different types are associated with different levels of reward as well as different
task durations (i.e., longer and better paid tasks versus shorter and paid worse). The most
distinctive keywords for identifying HIT types are ‘transcribe’, ‘audio’, and ‘survey’, which
clearly identify CC and SU HITs.
Using the classifier trained over the entire labelled dataset, we then performed a large-scale
classification of the types for all 2.5M HITs in our collection. This allows us to study the
evolution of the task types over time on the AMT platform, which we describe next.
27
Chapter 3. An Analysis of the Amazon Mechanical Turk Crowdsourcing Marketplace
100
101
102
103
104
105
CA CC IA IF SU VVTask Category
Co
un
t (l
og
)
Year 2009 2010 2011 2012 2013 2014
Figure 3.9 – Popularity of HIT types over time.
3.4.2 Task Type Popularity Over Time
Using the results of the large-scale classification of HIT types, we analyze which types of HITs
have been published over time. Figure 3.9 shows the evolution of task types published on
AMT. We can observe that, in general, the most popular type of task is Content Creation. In
terms of observable trends, we note that–while there is a general increase in the volume of
tasks on the platform—CA tasks have been decreasing over time. This can be explained by the
enforcement of AMT terms of service, which state that workers should not be asked to create
accounts on external websites or be identified by the requester. In the last three years, SU and
IA tasks have seen the biggest increase.
3.5 Analyzing the Features Affecting Batch Throughput
Next, we turn our attention to analyzing the factors that influence the progress (or the pace) of
a batch, how those factors influence each other and how their importance changes over time.
In order to conduct this analysis, we carry out a prediction experiment on the batch’s through-
put, that is, the number of HITs that will be completed for a given batch within the next time
frame of 1 hour (i.e., the D I F F _H I T feature is the target class). Specifically, we model this
task as a regression problem using 29 features; some of them were used in the previous section
to classify the HIT type; we describe the remaining ones in Section 3.5.1.
3.5.1 Machine Learning Features
The following is the list of features associated to each batch. We used these features in our
machine learning approach to predict batch throughput for the next hourly observation:
• HIT_available: Number of available HITs in the batch.
• start_time: The time of an observation.
• reward: HIT Reward in USD.
• description: String length of the batch’s description.
28
3.5. Analyzing the Features Affecting Batch Throughput
• title: String length of the batch’s title.
• keywords: Keywords (space separated).
• requester_id: ID of the requester.
• time_alloted: Time allotted per task.
• tasktype: Task class (as per our classification in 3.4).
• ageminutes: Age since the Batch was posted (minutes).
• leftminutes: Time left before expiration (minutes).
• location: The requested worker’s Location (e.g., US).
• totalapproved: Batch requirement on the number of total approved HITs.
• approvalrate: Batch requirement on the percentage of workers approval.
• master: Worker is a master.
• hitGroupsAvailableUI: Number of batches as reported on Mturk dashboard.
• hitsAvailableUI: Number of HITs available as reported on Mturk dashboard.
• hitsArrived: Number of new HITs arrived.
• hitsCompleted: Number of HITs completed.
• rewardsArrived: Sum of rewards associated with the HITs arrived.
• rewardsCompleted: Sum of rewards associated with the HITs completed.
• percHitsCompleted: Ratio of HITs completed and total HITs available.
• percHitsPosted: Ratio of new HITs arrived and total HITs available.
• diffHits: hitsCompleted-hitsArrived.
• diffHitsUI: Difference in HITs observed from Mturk dashboard.
• diffGroups: Computed difference in number of completed and arrived batches.
• diffGroupsUI: Difference in number of completed and arrived batches observed from
Mturk dashboard.
• diffRewards: Difference in rewards = (rewardsArrived-rewardsCompleted).
• DIFF_HIT: Number of HITs completed since the last observation.
3.5.2 Throughput Prediction
To predict the throughput of a batch at time T , we train a Random Forest Regression model
with samples taken in the range [T −δ,T ) where δ is the size of the time window that we are
considering directly prior to time T . The rationale behind this approach is that the throughput
should be directly correlated to the current and recent market situations.
We considered data from June to October 2014, and hourly observations (see Section 3.3.1),
from which we uniformly sampled 50 test time points for evaluation purposes. In our experi-
ments, the best prediction results, in terms of R-squared2, were obtained using δ= 4hour s.
For that window, our predicted versus actual throughput values are shown in Figure 3.10. The
figure suggests that the prediction works best for larger batches having a large momentum.
In order to understand which features contribute significantly to our prediction model, we
2http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
29
Chapter 3. An Analysis of the Amazon Mechanical Turk Crowdsourcing Marketplace
Figure 3.10 – Predicted vs actual batch throughput values for δ = 4hour s. The predictionworks best for larger batches having a large momentum.
proceed by feature ablation. For this experiment, we computed the prediction evaluation score
R-squared, for 1,000 randomly sampled test time points and kept those where the prediction
worked reasonably, i.e., having R-squared> 0, that is 327 samples. Next, we reran the prediction
on the same samples by removing one feature at a time. The results revealed that the features
H I T _avai l able (i.e., the number of tasks in the batch) and Ag e_mi nutes (i.e., how long
ago the batch was created) were the only ones having a statistically significant impact on the
prediction score with p < 0.05 and p < 0.01 respectively.
3.5.3 Features Importance
In order to better grasp the characteristics of the batch throughput, we examine the computed
Gini importance of the features [35]. In this experiment, we varied the training time frame
δ from 1 hour to 24 hours for each tested time point. Figure 3.11 shows the contribution of
our 2 top features (as concluded from the previous experiment, i.e., H I T _avai l able and
Ag e_mi nutes) and how their importances varied when we increased the training time-frame.
These features are again listed in Table 3.1, the slope indicates whether the feature is gaining
importance over time (positive value) or decreasing in importance (negative value).
The most important feature is H I T _avai l able, that is, the current size of the batch. Indeed,
as observed by previous work, larger batches tend to attract more workers [76, 64]. This
feature becomes less important when we consider longer periods, partly because of noise,
30
3.6. Market Analysis
HIT_available Age_minutes
0%
20%
40%
60%
80%
0 5
10
15
20
25 0 5
10
15
20
25
Time Delta Considered (Hours)
Fe
atu
re Im
po
rta
nce
%
Figure 3.11 – Computed feature importance when considering a larger training window forbatch throughput prediction.
Table 3.1 – Gini importance of the top 2 features used in the prediction experiment. A largemean indicates a better overall contribution to the prediction. A positive slope indicates thatthe feature is gaining in importance when the considered time window is larger.
Feature mean stderr slope interceptHIT_available 29.8606 13.4247 -0.0257 34.4940Age_minutes 12.9087 8.1967 -0.0050 13.8181
and because other features start to encode additional facts. On the other hand, Ag e_mi nutes
importance suggests that the crowd is sensitive to newly posted HITs, or how fresh the HITs
are. To better understand this phenomenon, we conduct an analysis on what attracts the
workforce to the platform in the next section.
3.6 Market Analysis
Finally, we study the demand and supply of the Amazon MTurk marketplace. In the following,
we define Demand as the number of new tasks published on the platform by the requesters.
In addition, we compute the average reward of the tasks that were posted. Conversely, we
define Suppl y as the workforce that the crowd is providing, concretized as the number of
tasks that got completed in a given time window by the workers. In this section we use hourly
collected data for the time period spanning June to October 2014.
31
Chapter 3. An Analysis of the Amazon Mechanical Turk Crowdsourcing Marketplace
3.6.1 Supply Attracts New Workers
We start by analyzing how the market reacts when new tasks arrive on the platform, in order
to understand the degree of elasticity of the supply. If the supply of work is inelastic, the
amount of work done over time should be independent of the demand for work. So, if the
amount of tasks available in the market (“demand”) increases, then the percentage of work
that gets completed in the market should drop, as the same amount of “work done" gets split
among a higher number of tasks. To understand the elasticity of the supply, we regressed the
percentage of work done in every time period (measured as the percentage of HITs that are
completed) against the number of new HITs that are posted in that period. Figure 3.12 shows
the scatterplot for those two variables.
Our data reveals that an increase in the number of arrived HITs is positively associated with a
higher percentage of completed HITs. This result provides evidence that the new work that is
posted is more attractive than the tasks previously available in the market, and attracts “new
work supply".3
Our regression4 of the “Percent Completed" against “Hits Arrived (in thousands)" indicates an
intercept of 2.5 and a slope of 0.05. To put these numbers in context: On average, there are 300K
HITs available in the market at any given time, and on average 10K new HITs arrive every hour.
The intercept of 2.5 means that 2.5% of these 300K HITs (i.e., 7.5K per hour) get completed, as
a baseline, assuming that no new HIT gets posted. The slope is 0.05, meaning that if 10K new
HITs arrive within an hour, then the completion ratio increases by 0.5%, to 3% (i.e., 9K HITs per
hour). When 50K new HITs arrive within an hour, then the completion percentage increases
to 5% indicating that 15K to 20K HITs get completed. In other words, approximately 20% of
the new demand gets completed within an hour of being posted, indicating that new work
has almost 10x higher attractiveness for the workers than the remaining work that is available
on the platform. This result could be explained by how tasks are presented to workers by AMT.
Workers, when not searching for tasks using specific keywords, are presented with the most
recently published tasks first.
3.6.2 Demand and Supply Periodicity
On the demand side, some requesters frequently post new batches of recurrent tasks. Hence,
we are interested in the periodicity of such demand in the marketplace and the supply it drives.
To look in this, we consider both the time-series of available HITs and the rewards completed.
First, we observe that the demand exhibits a strong weekly periodicity, which is reflected by
the autocorrelation that we compute from the number of available HITs on Amazon Mturk
(See Figure 3.13a and 3.13c). The market seems to have a significant memory that lasts for
3From the data available, it is not possible to tell whether the new supply comes from distinct workers, fromworkers that were idle, or from an increased productivity of existing workers.
4We use Ordinary Least Squares regression.
32
3.7. Discussion
−20 0 20 40 60 80 100 120
HITs Arrived (in thousand)
−5
0
5
10
15
20
25
30
Percent HITs Completed
OLSdata
Figure 3.12 – The effect of new arrived HITs on the work supplied. Here, the supply is expressedas the percentage of HITs completed in the market.
approximately 7-10 days.
Conversely, and to check for the periodicity in the supply, we compute an autocorrelation on
the weekly moving average of the completed HITs reward. Figure 3.13b and 3.13d show that
there is a strong weekly periodicity effect, as we observe high values in the range 0-250 hours.
3.7 Discussion
In this section, we summarize the main findings of our study and present a discussion of our
results. We extracted several trends from the five years data, summarized as follows:
• Tasks related to audio transcription have been gaining momentum in the last years and
are today the most popular tasks on AMT.
• The popularity of Content Access HITs has decreased over time. Surveys are however
becoming more popular over time especially in the US.
• While most HITs do not require country-specific workers, most of such HITs require
US-based workers.
• HITs that are exclusively asking for workers based in India have strongly decreased over
time.
• Surveys are the most popular type of HITs for US-based workers.
• The most frequent HIT reward value has increased over time, and reaches $0.05 in 2014.
• New requesters constantly join AMT, making the total number of active requesters and
the available reward increase over time.
• The average HIT batch size has been stable over time; however, very large batches have
recently started to appear on the platform.
33
Chapter 3. An Analysis of the Amazon Mechanical Turk Crowdsourcing Marketplace
Jun Jul Aug Sep Oct
Date
100
200
300
400
500
600
HITs Available (in thousands)
(a) Number of HITs available over a three monthsperiod.
Jul Aug Sep Oct
Date
80
100
120
140
160
180
200
220
240
Rewards Completed
(in thousand dollars)
(b) Weekly moving average on rewardscompleted over a three months period.
0 500 1000 1500 2000 2500 3000
Lag
−1.0
−0.5
0.0
0.5
1.0
Autocorrelation
(c) Autocorrelation on theHITs available.
0 500 1000 1500 2000 2500 3000
Lag
−1.0
−0.5
0.0
0.5
1.0
Autocorrelation
(d) Autocorrelation on the moving average ofrewards completed.
Figure 3.13 – Computed autocorrelation on the number of HITs available and on the weeklymoving average of the completed reward (N.B., autocorrelation’s Lag is computed in Hours).In both cases, we clearly see a weekly periodicity (0-250 Hours).
Our batch throughput prediction (Section 3.5) indicates that the throughput of batches can
be best predicted based on the number of HITs available in the batch,i.e., its size; and its
freshness, i.e., for how long the batch has been on the platform.
Finally, we analyzed AMT as a marketplace in terms of demand (new HITs arriving) and supply
(HITs completed). We observed some strong weekly periodicity both in demand and in supply.
We can hypothesize that many requesters might have repetitive business needs following
weekly trends, while many workers work on AMT on a regular basis during the week.
3.8 Conclusions
We studied data collected from a popular micro-task crowdsourcing platform, AMT, and ana-
lyzed a number of key dimensions of the platform, including: topic, task type, reward evolu-
tion, platform throughput, and supply and demand. The results of our analysis can serve as a
starting point for improving existing crowdsourcing platforms and for optimizing the overall
34
3.8. Conclusions
efficiency and effectiveness of human computation systems. The evidence presented above
indicate how requesters should use crowdsourcing platforms to obtain the best out of them:
By engaging with workers and publishing large volumes of HITs at specific points in time.
Future research based on this work might look at different directions. On one hand, novel
micro-task crowdsourcing platforms need to be designed based on the findings identified in
this work, such as the need for supporting specific task types like audio transcription or surveys.
Additionally, analyses that look at specific data could provide a deeper understanding of the
micro-task crowdsourcing universe. Examples include per-requester or per-task analyses of
the publishing behavior rather than looking at the entire market evolution as we did in this
work. Similarly, a worker-centered analysis could provide additional evidence of the existence
of different classes of workers, e.g., full-time vs casual workers, or workers specializing on
specific task types as compared to generalists who are willing to complete any available task.
In the next chapter, we will start by investigating a technique to aggregate the results of a HIT
(run with multiple repetitions) to provide better quality of the end-results.
35
4 Human Intelligence Task Quality As-surance
4.1 Introduction
One of the most significant benefits of crowdsourcing is the ability to tap into human compu-
tation at scale. Hundreds, or even thousands, of crowd workers can participate in a crowd-
sourcing campaign, thus contributing to its quick completion. The only caveat is that the
collected answers are not verified one by one (as it defeats the purpose of crowdsourcing),
and are usually subject to high error rates. In fact, some workers might be malicious and try
to do the tasks quickly by providing random answers in order to collect the rewards with the
least effort. One strategy that can be used to avoid high error rates, is to use test questions to
stop poor performing workers. However, human error is not always a sign of maliciousness;
it can simply be due to fatigue, a defect in the system, bias, excess-confidence or any other
temporary factor. Even the most honest workers cannot consistently perform at 100% all the
time, hence stoping workers can be considered as an extreme measure.
Another compatible method is to use repetitions, i.e., ask multiple workers for the same
task and then automatically decide which answer to pick based on some form of agreement
scheme. Majority vote is the simplest approach to use; it consists in selecting the answer that
most of the workers selected. However, 1) majority vote can be easily cheated, e.g., multiple
malicious workers can agree on an answer, and 2) it gives all the workers the same weight,
regardless whether we have some prior knowledge about the workers’ reliability.
In this chapter, we investigate a Bayesian framework to assess dynamically the results of
tasks with Multiple Choice Questions obtained from arbitrary human workers operating on a
crowdsourcing platform. We show that we can effectively combine workers answers by taking
into account an adaptive weight associated with each worker in addition to any available prior
output of an algorithmic pre-processing step. In the following, we focus on two use-cases,
namely: Entity Linking and Instance Matching (see Section 4.1.1 for an overview), and for
which we also develop a hybrid human-machine system that we describe.
37
Chapter 4. Human Intelligence Task Quality Assurance
4.1.1 The Entity Linking and Instance Matching Use-Cases
Semi-structured data is becoming more prominent on the Web as more and more data is
either interweaved or serialized in HTML pages. The Linked Open Data (LOD) community1,
for instance, is bringing structured data to the Web by publishing datasets using the RDF
formalism and by interlinking pieces of data coming from heterogeneous sources. As the
LOD movement gains momentum, linking traditional Web content to the LOD cloud is giving
rise to new possibilities for online information processing. For instance, identifying unique
real-world objects, persons, or concepts, in textual content and linking them to their LOD
counterparts (also referred to as Entities), opens the door to automated text enrichment (e.g.,
by providing additional information coming from the LOD cloud on entities appearing in the
HTML text), as well as streamlined information retrieval and integration (e.g., by using links to
retrieve all text articles related to a given concept from the LOD cloud).
As more LOD datasets are being published on the Web, unique entities are getting described
multiple times by different sources. It is therefore critical that such openly available datasets
are interlinked to each other in order promote global data interoperability. The interlinking of
datasets describing similar entities enables Web developers to cope with the rapid growth of
LOD data, by focusing on a small set of well-known datasets (such as DBPedia2 or Freebase3)
and by automatically following links from those datasets to retrieve additional information
whenever necessary.
Automatizing the process of instances matching (IM) from heterogeneous LOD datasets and
the process of entities linking (EL) appearing in HTML pages to their correct LOD counterpart
are currently drawing a lot of attention (see the Related Work section below). These processes
represent however a highly challenging task, as instance matching is known to be extremely
difficult even in relatively simple contexts. Some of the challenges that arise in this context are
1) to identify entities appearing in natural text, 2) to cope with the large-scale and distributed
nature of LOD, 3) to disambiguate candidate concepts, and 4) to match instances across
datasets.
The current matching techniques used to relate an entity extracted from text to corresponding
entities from the LOD cloud as well as those used to identify duplicate entities across datasets
can be broadly classified into two groups:
Algorithmic Matching: Given the scale of the problem (that could potentially span the entire
HTML Web), many efforts are currently focusing on designing and deploying scalable
algorithms to perform the matching automatically on very large corpuses.
Manual Matching: While algorithmic matching techniques are constantly improving, they
are still at this stage not as reliable as humans. Hence, many organizations are still today
1http://linkeddata.org/2http://www.dpbedia.org3http://freebase.org
38
4.1. Introduction
appointing individuals to manually link textual elements to concepts. For instance, the
New York Times employs a whole team whose sole responsibility is to manually create
links from news articles to NYT identifiers4.
ZenCrowd is a system that we have developed in order to create links across large datasets
containing similar instances and to semi-automatically identify LOD entities from textual
content. Our system gracefully combines algorithmic and manual integration, by first taking
advantage of automated data integration techniques, and then by improving the automatic
results by involving human workers.
The ZenCrowd approach addresses the scalability issues of data integration by proposing a
novel three-stage blocking technique that incrementally combines three very different ap-
proaches together. In a first step, we use an inverted index built over the entire dataset to
efficiently determine potential candidates and to obtain a first ranked list of potential results.
Top potential candidates are then analyzed further by taking advantage of a more accurate
(but also more costly) graph-based instance matching techniques (a similar structured/un-
structured hybrid approach has been taken in [151]). Finally, results yielding low confidence
values (as determined by probabilistic inference) are used to dynamically create micro tasks
published on a crowdsourcing platform, the assumption being that tasks in question do not
need special expertise to be performed.
ZenCrowd does not focus on the algorithmic problems of instance matching and entity linking
per se. However, we make a number of key contributions at the interface of algorithmic and
manual data integration, and discuss in detail how to most effectively and efficiently combine
scalable inverted indices, structured graph queries and human computation in order to match
large LOD datasets. The contributions described in this chapter include:
• a new system architecture supporting algorithmic and manual instance matching as
well as entity linking in concert.
• a new three-stage blocking approach that combines highly scalable automatic filtering
of semi-structured data together with more complex graph-based matching and high-
quality manual matching performed by the crowd.
• a new probabilistic inference framework to dynamically assess the results of arbitrary
human workers operating on a crowdsourcing platform, and to effectively combine
their (potentially conflicting) answers taking into account the results of the automatic
stage output.
• an empirical evaluation of our system in a real deployment over different Human In-
telligence Task interfaces showing that ZenCrowd combines the best of both worlds,
in the sense that our combined approach turns out to be more effective than both (a)
pure algorithmic, by improving the accuracy and (b) full manual matching, by being
cost-effective while mitigating the workers’ uncertainty.
4see http://data.nytimes.com/
39
Chapter 4. Human Intelligence Task Quality Assurance
The rest of this chapter is structured as follows: Section 4.2 introduces the terminology used
throughout the chapter. Section 4.3 gives an overview of the architecture of our system, includ-
ing its algorithmic matching interface, its probabilistic inference engine, and its templating
and crowdsourcing components. Section 4.4 presents our graph-based matching confidence
measure as well as different methods to crowdsource instance matching and entity linking
tasks. We describe our formal model to combine both algorithmic and crowdsourcing results
using Probabilistic Networks in Section 4.5. We introduce our evaluation methodology and
discuss results from a real deployment of our system for the Instance Matching task in Sec-
tion 4.6 and for the Entity Linking task in Section 4.7. We review the state of the art in instance
matching and entity linking in Section 4.8, before concluding in Section 4.9.
4.2 Preliminaries on the EL and IM Tasks
As already mentioned, ZenCrowd addresses two distinct data integration tasks related to the
general problem of Entity Resolution [66].
We define Instance Matching as the task of identifying two instances following different
schemas (or ontologies) but referring to the same real-world object. Within the database
literature, this task is related to Record Linkage [39], Duplicate Detection [27], or Entity Identi-
fication [107] when performed over two relational databases. However, in our setting, the main
goal is to create new cross-dataset <owl:sameAs> RDF statements. As commonly assumed for
Record Linkage, we also assume that there are no duplicate entities within the same source and
leverage this assumption when computing the final probability of a match in our probabilistic
reasoning step.
We define Entity Linking as the task of assigning a URI selected from a background knowledge
base for an entity mentioned in a textual document. This task is also known as Entity Resolu-
tion [66] or Disambiguation [36] in the literature. In addition to the classic entity resolution
task, the objective of our task is not only to understand which possible interpretation of the
entity is correct (Michael Jordan the basketball player as compared to the UC Berkeley pro-
fessor), but also to assign a URI to the entity, which can be used to retrieve additional factual
information about it.
Given two LOD dataset U1 = {u11, ..,u1n} and U2 = {u21, ..,u2m} containing structured entity
descriptions ui j , where i identifies the dataset and j the entity URI, we define Instance
Matching as the identification of each pair (u1i ,u2 j ) of entity URIs from U1 and U2 referring to
the same real-world entity and call such a pair a match. An example of match is given by the
pair u11 = <http://dbpedia.org/resource/Tom_Cruise> and u21 = <http://www.freebase.com/
m/07r1h> where U1 is the DBPedia LOD dataset and U2 is the Freebase LOD dataset.
Given a document d and a LOD dataset U1 = {u11, ..,u1n}, we define Entity Linking as the task
of identifying all entities in U1 from d and of associating the corresponding identifier u1i to
each entity.
40
4.3. ZenCrowd Architecture
These two tasks are highly related: Instance Matching aims at creating connections between
different LOD datasets that describe the same real-world entity using different vocabularies.
Such connections can then be used to run Linking on textual documents. Indeed, ZenCrowd
uses existing <owl:sameAs> statements as probabilistic priors to take a final decision about
which links to select for an entity appearing in a textual document.
Hence, we use in the following the term entity to refer to a real-world object mentioned in a
textual document (e.g., a news article) while we use the term instance to refer to its structured
description (e.g., a set of RDF triples), which follows the well-defined schema of a LOD dataset.
Our system relies on LOD datasets for both tasks. Such Linked Datasets describe intercon-
nected entities that are commonly mentioned in Web content. As compared to traditional
data integration tasks, the use of LOD data may support integration algorithms by means of
its structured entity descriptions and entity interlinking within and across datasets.
In our work, we make use of Human Intelligence at scale to, first, improve the quality of
such links across datasets and, second, to connect unstructured documents to the structured
representation of the entities they mention. To improve the result for both tasks, we selectively
use paid micro-task crowdsourcing. To do this, we create HITs on a crowdsourcing platform.
For the Entity Linking task, a HIT consists of asking which of the candidate links is correct for
an entity extracted from a document. For the Instance Matching task, a HIT consists in finding
which instance from a target dataset corresponds to a given instance from a source dataset.
See Figure 4.2, 4.3, and 4.4, which give examples of such tasks.
Paid crowdsourcing presents enormous advantages for high quality data processing. The
disadvantages, however, potentially include: high financial cost, low availability of workers,
and poor workers’ skills or honesty. To overcome those shortcomings, we alleviate the financial
cost using an efficient decision engine that selectively picks tasks that have a high improve-
ment potential. Our present assumption is that entities extracted from HTML news articles
could be recognized by the large public, especially when provided with sufficient contextual
information. Furthermore, each task is shown to multiple workers to balance out low quality
answers.
4.3 ZenCrowd Architecture
ZenCrowd is a hybrid human-machine architecture that takes advantage of both algorithmic
and manual data integration techniques simultaneously. Figure 4.1 presents a simplified
architecture of our system. We start by giving an overview of our system below in Section 4.3.1,
and then describe in more detail some of its components in Sections 4.3.2 to 4.3.4.
41
Chapter 4. Human Intelligence Task Quality Assurance
Micro Matching
Tasks
HTMLPages
LOD Open Data Cloud
CrowdsourcingPlatform
ZenCrowdEntity
Extractors
LOD Index
Output
Probabilistic Network
Decision Engine
AlgorithmicLinkers
Micr
o-Ta
sk M
anag
er
Workers Decisions
AlgorithmicMatchers
InputDataset Pair
Graph DB 1
2
3
HTML + RDFa Pages
New Matchings<owl:sameAs>
Indexing
Input
Figure 4.1 – The architecture of ZenCrowd: For the Instance Matching task (green pipeline),the system takes as input a pair of datasets to be interlinked and creates new links between thedatasets using <owl:sameAs> RDF triples. ZenCrowd uses a three-stage blocking procedurethat combines both algorithmic matchers and human workers in order to generate high qualityresults. For the Entity Linking task (orange pipeline), our system takes as input a collectionof HTML pages and enriches them by extracting textual entities appearing in the pages andlinking them to the Linked Open Data cloud.
4.3.1 System Overview
In the following we describe the different components of the ZenCrowd system focusing first
on the Instance Matching and then on the Entity Linking pipeline.
Instance Matching Pipeline
In order to create new links, ZenCrowd takes as input a pair of datasets from the LOD cloud.
Among the two datasets, one is selected as the source dataset and one as the target dataset.
Then, for each instance of the source dataset, our system tries to come up with candidate
matches from the target dataset.
First, the label used to name the source instance is used to query the LOD Index (see Sec-
tion 4.3.2) in order to obtain a ranked list of candidate matches from the target dataset. This
can efficiently, and cheaply, filter out numerous clear non-matches out of potentially numer-
ous (in the order of hundreds of millions for some LOD datasets) instances available. Next,
top-ranked candidate instances are further examined in the graph database. This step is taken
to obtain more complete information about the target instances, both to compute a more
accurate matching score as well as to provide information to the Micro-Task Manager (see
Figure 4.1), which has to fill the HIT templates for the crowd (see Section 4.3.5, which describes
42
4.3. ZenCrowd Architecture
our three-stage blocking methodology in more detail). At this point, the candidate matches
that have a low confidence score are sent to the crowd for further analysis. The Decision Engine
collects confidence scores from the previous steps in oder to decide what to crowdsource,
together with data from the the graph database to construct the HITs.
Finally, we gather the results provided by the crowd into the Probabilistic Network component,
which combines them to come up with a final matching decision. The generated matchings
are then given as output by ZenCrowd in the form of RDF <owl:sameAs> links that can be
added back to the LOD cloud.
Entity Linking Pipeline
The other task ZenCrowd performs is Entity Linking, that is, identifying occurrences of LOD
entities in textual content and creating links from the text to corresponding instances stored
in a database. ZenCrowd takes as input sets of HTML pages (that can for example be provided
by a Web crawler). The HTML pages are then passed to Entity Extractors that inspect the
pages and identify potentially relevant textual entities (e.g., persons, companies, places, etc.)
mentioned in the page. Once detected, the entities are fed into Algorithmic Linkers that
attempt to automatically link the textual entities to semantically similar instances from the
LOD cloud. As querying the Web of data dynamically to link each entity would incur a very
high latency, we build a local cache (called LOD Index in Figure 4.1) to locally retrieve and
index relevant information from the LOD cloud. Algorithmic linkers return lists of top-k links
to LOD entities, along with a confidence value for each potentially relevant link.
The results of the algorithmic linkers are stored in a Probabilistic Network, and are then
combined and analyzed using probabilistic inference techniques. ZenCrowd treats the results
of the algorithmic linkers in three different ways depending on their quality. If the algorithmic
results are deemed excellent by our Decision Engine, the results (i.e., the links connecting a
textual entity extracted from an HTML page to the LOD cloud) get stored in a local database
directly. If the results are deemed useless (e.g., when all the links picked by the linkers have a
low confidence value), the results get discarded. Finally, if the results are deemed promising but
uncertain (for example because several algorithmic linkers disagree on the links, or because
their confidence values are relatively low), they are then passed to the Micro-Task Manager,
which extracts relevant snippets from the original HTML pages, collects all promising links,
and dynamically creates a micro-task using a templating engine. An example of micro-task for
the entity linking pipeline is shown in Figure 4.4. Once created, the micro-task is published
on a crowdsourcing platform, where it is handled by the crowd workers. When the human
workers have performed their task (i.e., when they have picked the relevant links for a given
textual entity), workers results are fed back to the Probabilistic Network. When all the links
are available for a given HTML page, an enriched HTML page—containing both the original
HTML code as well as RDFa annotations linking the textual entities to their counterpart from
the LOD cloud—is finally generated.
43
Chapter 4. Human Intelligence Task Quality Assurance
4.3.2 LOD Index and Graph Database
The LOD index engine is an information retrieval engine that we built on top of the LOD
dataset to speed up the entity retrieval process. While most LOD datasets provide a public
SPARQL interface, they are in practice very cumbersome to use due to the very high latency
(from several hundreds of milliseconds to several seconds) and bandwidth consumption they
impose. Instead of querying the LOD cloud dynamically for each new instance to be matched,
ZenCrowd caches locally pertinent information from the LOD cloud. Our LOD Index engine
receives as input a list of SPARQL endpoints or LOD dumps as well as a list of triple patterns,
and iteratively retrieves all corresponding triples from the LOD datasets. Using multiple LOD
datasets improves the coverage of our system, since some datasets cover only geographical
locations, while other datasets cover the scientific domain or general knowledge. The infor-
mation thus extracted is cached locally in two ways: in a local graph query engine—offering
a SPARQL interface—and in an inverted index to provide efficient support for unstructured
queries.
After ranked results are obtained from the LOD index, a more in-depth analysis of the candidate
matches is performed by means of queries to a graph database. This component stores and
indexes data from the LOD datasets and accepts SPARQL queries to retrieve predicate value
pairs attached to the query node. This component is used both to define the confidence
scoring function by means of schema-matching results (Section 4.4.1) as well as to compute
confidence scores for candidate matches and to show matching evidence to the crowd (Section
4.4.2).
4.3.3 Probabilistic Graph & Decision Engine
Instead of using heuristics or arbitrary rules, ZenCrowd systematizes the use of Probabilistic
Networks to make sensible decisions about the potential instance matches and entities links.
All evidences gathered both from the algorithmic methods and the crowd are fed into a
Probabilistic Network, and used by our decision engine to process all entities accordingly. Our
probabilistic models are described in detail in Section 4.5.
4.3.4 Extractors, Algorithmic Linkers & Algorithmic Matchers
The Extractors and Algorithmic Linkers are used exclusively by the Entity Linking pipeline (see
Figure 4.1). The Entity Extractors receive HTML as input, and extract named entities appearing
in the HTML content as output. Entity Extraction is an active area of research and a number of
advances have recently been made in that field (using for instance third-party information or
novel NLP techniques). Entity extraction is not the focus of our work in ZenCrowd. However,
we support arbitrary entity extractors through a generic interface in our system and union
their respective output to obtain additional results.
Once extracted, the textual entities are inspected by algorithmic linkers, whose role is to
44
4.3. ZenCrowd Architecture
find semantically related entities from the LOD cloud. ZenCrowd implements a number of
state of the art linking techniques (see Section 4.7 for more details) that take advantage of
the LOD Index component to efficiently find potential matches. Each matcher also imple-
ments a normalized scoring scheme, whose results are combined by our Decision Engine (see
Section 4.5).
4.3.5 Three-Stage Blocking for Crowdsourcing Optimization
For the Instance Matching pipeline, a naive implementation of an Algorithmic Matcher would
check each pair of instances from two input datasets. However, the problem of having to
deal with too many candidate pairs rapidly surfaces. Moreover, crowdsourcing all possible
candidate pairs is unrealistic: For example, matching two datasets containing just 1’000
instances each would cost $150’000 if we crowdsource 1’000’000 possible pairs to 3 workers
paying $0.05 per task. Instead, we propose a three-stage blocking approach.
A common way to deal with the quadratic number of potential comparisons is blocking (see
Section 4.8). Basically, blocking groups promising candidate pairs together in sets using a
computationally inexpensive method (e.g., clustering) and, as a second step, performs all
possible comparisons within such sets using a more expensive method (e.g., string similarity).
ZenCrowd uses a three-stage blocking approach that involves crowdsourcing as an additional
step in the blocking process (see the three stages in Figure 4.1). Crowdsourcing the instance
matching process is expensive both in terms of latency as well as financially. For this reason,
only a very limited set of candidate pairs should be crowdsourced when matching large
datasets.
Given a source instance from a dataset, ZenCrowd considers all instances of the target dataset
as possible matches. The first blocking step is performed by means of an inverted index
over the labels of all instances in the target dataset. This allows to produce a list of instances
ranked by a scoring function that measures the likelihood of matching the source instance
very efficiently (i.e., in the order of milliseconds).
As a second step, ZenCrowd computes a more accurate but also more computationally ex-
pensive matching confidence for the top-ranked instances generated by the first step. This
confidence value is computed based on schema matching results among the two datasets and
produces a score in [0,1]. This value is not computed on all instances of the target dataset
but rather for those that are likely to be a good match as given by the first blocking step (see
Section 4.4.1).
This hybrid approach exploiting the interdependence of unstructured indices as well as
structured queries against a graph database is similar to the approach taken in [151] where,
for the task of Ad-Hoc Object Retrieval, a ranked list of results is improved by means of an
analysis of the result vicinity in the graph.
45
Chapter 4. Human Intelligence Task Quality Assurance
The final step consists in asking the crowd about candidate matching pairs. Based on the
confidence score computed during the previous step, ZenCrowd takes a decision about which
HITs to create on the crowdsourcing platform. As the goal of the confidence score is to
indicate how likely it is that a pair is a correct match, the system selects those cases where
the confidence is not already high enough so that it can be further improved by asking the
crowd. Possible instantiations of this step may include the provision of a fixed budget for the
crowdsourcing platform, which the system is allowed to spend in order to optimize the quality
of the results. Generally speaking, the system produces a ranked list of candidate pairs to be
crowdsourced based on the confidence score. Then, given the available resources, top pairs
are crowdsourced by batch to improve the accuracy of the matching process. On the other
hand, improving the task completion time can be obtained by increasing the reward assigned
to workers.
4.3.6 Micro-Task Manager
The micro-task manager is responsible for dynamically creating human computation tasks
that are then published on a crowdsourcing platform. Whenever a match is deemed promising
by our Decision Engine (see below for details), it is sent to the crowd for further examination.
The micro-task manager dynamically builds a Web page to be published on the crowdsourcing
platform using three resources: i) the name of the source instance ii) some contextual infor-
mation generated by querying the graph database and iii) the current top-k matches for the
instance from the blocking process. Once created and published, the matching micro-tasks
can be selected by workers on the crowdsourcing platform, who are then asked to select the
relevant matches (if any) for the source instance, given its name, the contextual information
from the graph database, and the various candidate matches described as in the LOD cloud.
Once performed, the results of the micro-matching tasks are sent back to the Micro-Task
Manager, which inserts them in the Probabilistic Network.
4.4 Effective Instance Matching based on Confidence Estimation and
Crowdsourcing
In this section, we describe the final steps of the blocking process that assure high quality
instance matching results. We first define our schema-based matching confidence measure,
which is then used to decide which candidate matchings to crowdsource. Then, we present
different approaches to crowdsourcing instance matching tasks. Specifically we compare two
different HIT designs where different context information about the instances is presented to
the worker.
46
4.4. Effective Instance Matching based on Confidence Estimation and Crowdsourcing
OrganizationDBPedia Freebase
http://www.w3.org/2000/01/rdf-schema#label http://rdf.freebase.com/ns/type.object.namehttp://dbpedia.org/property/established http://rdf.freebase.com/ns/education.educational_institution.foundedhttp://dbpedia.org/property/foundation http://rdf.freebase.com/ns/business.company.founded
http://dbpedia.org/property/companyName http://rdf.freebase.com/ns/type.object.namehttp://dbpedia.org/property/founded http://rdf.freebase.com/ns/sports.sports_team.founded
PersonDBPedia Freebase
http://www.w3.org/2000/01/rdf-schema#label http://rdf.freebase.com/ns/type.object.namehttp://dbpedia.org/ontology/birthdate http://rdf.freebase.com/ns/people.person.date_of_birth
http://dbpedia.org/property/name http://rdf.freebase.com/ns/type.object.namehttp://dbpedia.org/property/dateOfBirth http://rdf.freebase.com/ns/people.person.date_of_birth
http://dbpedia.org/property/dateOfDeath http://rdf.freebase.com/ns/people.deceased_person.date_of_deathhttp://dbpedia.org/property/birthname http://rdf.freebase.com/ns/common.topic.alias
LocationDBPedia Freebase
http://www.w3.org/2000/01/rdf-schema#label http://rdf.freebase.com/ns/type.object.namehttp://dbpedia.org/property/establishedDate http://rdf.freebase.com/ns/location.dated_location.date_founded
http://dbpedia.org/ontology/demonym http://rdf.freebase.com/ns/freebase.linguistic_hint.adjectival_formhttp://dbpedia.org/property/name http://rdf.freebase.com/ns/type.object.name
http://dbpedia.org/property/isocode http://rdf.freebase.com/ns/location.administrative_division.iso_3166_2_codehttp://dbpedia.org/property/areaTotalKm http://rdf.freebase.com/ns/location.location.area
Table 4.1 – Top ranked schema element pairs in DBPedia and Freebase for the Person, Location,and Organization instances.
4.4.1 Instance-Based Schema Matching
While using the crowd to match instances across two datasets typically results in high quality
matchings, it is often infeasible to crowdsource all potential matches because of the very
high financial cost associated. Thus, as a second filtering step, we define a new measure that
computes the confidence of a matching as generated by the initial inverted index blocking
step.
Formally, given a candidate matching pair (i 1, i 2) we define a function f (i 1, i 2) that creates a
ranked list of candidate pairs such that the pairs ranked at the top are the most likely to be
correct. In such a way, it is possible to selectively crowdsource candidate matchings with lower
confidence to improve matching precision with a limited cost.
The matching confidence measure used by ZenCrowd is based on schema matching informa-
tion. The first step in the definition of the confidence measure consists in using a training set
of matchings among the two datasets5. Given a training pair (t1, t2) we retrieve all predicates
and values for the instances t1 and t2 and perform an exact string match comparison of their
values. At the end of such process, we rank predicate pairs by the number of times an exact
match on their values has occurred. Table 4.1 gives the top ranked predicate pairs for the
DBPedia and Freebase datasets. We observe that this simple instance-based schema mapping
techniques yields excellent results for many LOD schemas. For instance, for the entity type
5In our experiments we use 100 ground-truth matchings that are discarded later when evaluating the proposedmatching approaches.
47
Chapter 4. Human Intelligence Task Quality Assurance
person in Table 4.1, where ‘birthdate’ from DBPedia is correctly matched to ‘date_of_birth’
from Freebase.
After the list of schema elements have been matched across the two datasets, we define the
confidence measure for an individual candidate matching pair. To obtain a confidence score
in [0,1] we compute the average Jaccard similarity among all tokenized values of all matched
schema elements for the two candidate instances u1 and u2. In the case a list of values is
assigned to a schema element (e.g., a DBPedia instance may have multiple labels that represent
the instance name in different languages) we retain the maximum Jaccard similarity value in
the list for that schema element. For example, the confidence score of the following matching
pairs will be (2/3)+(1)2 = 0.83.
u1 u2
rdfs:label barack h. obama fb:name barack obama
dbp:dateOfBirth 08-04-61 fb:date_of_birth 08-04-61
4.4.2 Instance Matching with the Crowd
We now turn to the description of two HIT designs we experimented with for crowdsourc-
ing instance matching in ZenCrowd. Previous work also compared different interfaces to
crowdsourcing instance matching tasks [164]. Specifically, the authors compared pairwise and
table-based matching interfaces. Instead, we compare matching interfaces based on different
pieces of information given to the worker directly on the HIT page.
Figure 4.2 and 4.3 show our two different interfaces for the instance matching task. The label-
only matching interface asks the crowd to find a target entity among the proposed matches.
In this case, the target entity is presented as its label with a link to the corresponding LOD
webpage. Then, the top ranked instances from the DBPedia dataset, which are candidates to
match the target entity, are shown. This interface is reminiscent of the automatic approach
based on the inverted index that performs the initial blocking step though on a larger scale
(i.e., only few candidates are shown to the worker in this case).
The molecule interface also asks the worker to identify the target entity (from Freebase in the
figure) in the table containing top-ranked entities from DBPedia. This second interface defines
a simpler task for the worker by presenting directly on the HIT page relevant information
about the target entity as well as about the candidate matches. In this second version of the
interface, the worker is asked to directly match the instance on the left with the corresponding
instance on the right. Compared to the first matching interface, the molecule interface does
not just display the labels but also additional information (property and value pairs) about
each instance. Such information is retrieved from the graph database and displayed to the
worker.
In both interfaces, the worker can select the “No match” option if no instance matches the
48
4.5. Probabilistic Models
Figure 4.2 – The Label-only instance matching HIT interface, where entities are displayed astextual labels linking to the full entity descriptions in the LOD cloud.
target entity. An additional field is available for the worker to leave comments.
Manual inspection of crowdsourcing results has shown that most of errors on the many-to-
many matching interface were due to the fact that the workers did not match the target entity
but, rather, they correctly matched a different entity. For example, when the NYT target entity
was a city, many workers instead selected from both Freebase and DBPedia tables an instance
about the music festival hosted in that city. Therefore, while the matching between the two
tables is correct as the same instance was identified in both candidate sets, this was not the
target instance the task was asking to match.
4.5 Probabilistic Models
ZenCrowd exploits probabilistic models to make sensible decisions about candidate results.
We describe below the probabilistic models used to systematically represent and combine
information in ZenCrowd, and how those models are implemented and handled by our system.
In the following we use factor-graphs to graphically represent probabilistic variables and
distributions. Note that our approach is not bound to this representation—we could use series
of conditional probabilities only or other probabilistic graphical models—but we decided to
use factor-graphs for their illustrative merits. For an in-depth coverage on factor graphs, we
refer the interested reader to one of the many overviews on this domain, such as [95], or to our
brief introduction made in [50].
49
Chapter 4. Human Intelligence Task Quality Assurance
Figure 4.3 – The Molecule instance matching HIT interface, where the labels of the entities aswell as related property-value pairs are displayed.
4.5.1 Graph Models
We start by describing the probabilistic graphs used to combine all matching evidences
gathered for a given candidate URI. Consider an instance from the source dataset. The
candidate matches are stored as a list of potential matchings m j from a LOD dataset. Each m j
has a prior probability distribution pm j computed from the confidence matching function.
Each candidate can also be examined by human workers wi performing micro-matching tasks
and performing clicks ci j to express the fact that a given candidate matching corresponds (or
not) to the source instance from his/her perspective.
Workers, matchings, and clicks are mapped onto binary variables in our model. Workers
accept two values {Good ,B ad} indicating whether they are reliable or not. Matchings can
either be Cor r ect or Incor r ect . As for click variables, they represent whether the worker i
considers that the source instance is the same as the proposed matching m j (Cor r ect ) or not
(Incor r ect ). We store prior distributions—which represent a priori knowledge obtained for
example through training phases or thanks to external sources—for each workers (pwi ()) and
each matching (pm j ()). The clicks are observed variables and are set to Cor r ect or Incor r ect
depending on how the human workers clicked on the crowdsourcing platform.
A simple example of such an entity graph is given in Figure 4.5. Clicks, workers, and matchings
are further connected through two factors described below.
The same network can be instantiated for each entity of an Entity Linking task where m j are
candidate links from the LOD instead.
50
4.5. Probabilistic Models
Figure 4.4 – The Entity Linking HIT interface.
Matching & Linking Factors
Specific task (either matching or linking) factors m f j () connect each candidate to its related
clicks and the workers who performed those clicks. Examining the relationships between
those three classes of variables, we make two key observations: i) clicks from reliable workers
should weight more than clicks from unreliable workers (actually, clicks from consistently
unreliable workers deciding randomly if a given answer is relevant or not should have no
weight at all in our decision process) and ii) when reliable workers do not agree, the likelihood
of the answer being correct should be proportional to the fraction of good workers indicating
the answer as correct. Taking into account both observations, and mapping the value 0 to
Incor r ect and 1 to Cor r ect , we write the following function for the factor:
m f (w1, . . . , wm ,c1, . . . ,cn ,m) ={
0.5, if ∀wi ∈ {w1, . . . , wm} wi = B ad∑i 1(wi =Good ∧ ci =m)∑
i 1(wi =Good),otherwise
(4.1)
where 1(cond) is an indicator function equal to 1 when cond is true and 0 otherwise.
Unicity Constraints for Entity Linking
Given that the instance matching task definition assumes that only one instance from the
target dataset can be a correct match for the source instance. Similarly, a concept appearing
in textual content can only be mapped to a single entity from a given dataset. We can thus
rule out all configurations where more than one candidate from the same LOD dataset are
considered as Cor r ect . The corresponding factor u() is declared as being equal to 1 and is
51
Chapter 4. Human Intelligence Task Quality Assurance
w1 w2
m1 m2
pw1( ) pw2( )
mf1( ) mf2( )
pm1( ) pm2( )
m3
mf3( )
pm3( )
c11 c22c12c21 c13 c23
sa2-3( )u1-2( )
Figure 4.5 – An entity factor-graph connecting two workers (wi ), six clicks (ci j ), and threecandidate matchings (m j ).
defined as follows:
u(m1, . . . ,mn) =
0, if ∃(mi ,m j ) ∈ {m1, . . . ,mn}
| mi = m j =Cor r ect
1, otherwise
(4.2)
SameAs Constraints for Entity Linking
SameAs constraints are exclusively used in Entity Linking graphs. They exploit the fact that
the resources identified by the links to the LOD cloud can themselves be interlinked (e.g.,
dbpedia:Fribourg is connected through an owl:sameAs link to fbase:Fribourg in the LOD
cloud)6. Considering that the SameAs links are correct, we define a constraint on the variables
connected by SameAs links found in the LOD cloud; the factor sa() connecting those variables
puts a constraint forbidding assignments where the variables would not be set to the same
values as follows:
sa(l1, . . . , ln) ={
1 if ∀(li , l j ) ∈ {l1, . . . , ln} li = l j
0 otherwise
We enforce the constraint by declaring sa() = 1. This constraint considerably helps the decision
process when strong evidences (good priors, reliable clicks) are available for any of the URIs
connected to a SameAs link. When not all SameAs links should be considered as correct, further
probabilistic analyses (e.g., on the transitive closures of the links as defined in idMesh [45])
can be put into place.
6We can already see the benefit of having better matchings across datasets for that matter.
52
4.5. Probabilistic Models
4.5.2 Reaching a Decision
Given the scheme above, we can reach a sensible decision by simply running a probabilistic
inference method (e.g., the sum-product algorithm described above) on the network, and
considering as correct all matchings with a posterior probability P (l =Cor r ect) > 0.5. The
Decision Engine can also consider a higher threshold τ > 0.5 for the decisions in order to
increase the precision of the results.
4.5.3 Updating the Priors
Our computations always take into account prior factors capturing a priori information about
the workers. As time passes, decisions are reached on the correctness of the various matches,
and the Probabilistic Network iteratively accumulates posterior probabilities on the reliability
of the workers. Actually, the network gets new posterior probabilities on the reliability of the
workers for every new matching decision that is reached. Thus, the Decision Engine can decide
to modify the priors of the workers by taking into account the evidences accumulated thus
far to enhance future results. In a probablisitic graphical model with missing observations,
this corresponds to a learning parameters phase. To tackle this type of problem, we use in the
following a simple Expectation-Maximization [44, 52] process as follows:
- Initialize the prior probability of the workers using a training phase during which work-
ers are evaluated on k matches whose results are known. Initialize their prior reliability
to #cor r ect_r esul t s/k. If no information is available or no training phase is possible,
start with P (w = r el i abl e) = P (w = unr el i able) = 0.5 (maximum entropy principle).
- Gather posterior evidences on the reliability of the workers P (w = r el i abl e|mi =Cor r ect/Incor r ect ) as soon as a decision is reached on a matching. Treat these ev-
idences as new observations on the reliability of the workers, and update their prior
beliefs iteratively as follows:
P (w = r el i abl e) =k∑
i=1Pi (w = r el i abl e|mi )k−1 (4.3)
where i runs over all evidences gathered so far (from the training phase and from the posterior
evidences described above). Hence, we make the prior values slowly converge to their maxi-
mum likelihood to reflect the fact that more and more evidences are being gathered about the
mappings as we reach more decisions on the instances. This technique can also be used to
identify and blacklist unreliable workers dynamically.
4.5.4 Selective Model Instantiation
The framework described above actually creates a gigantic probabilistic graph, where all in-
stances, clicks, and workers are indirectly connected through various factors. However, only a
small subset of the variables need to be considered by the inference engine at any point in
53
Chapter 4. Human Intelligence Task Quality Assurance
time. Our system updates the various priors iteratively, but only instantiates the handful of
variables useful for reaching a decision on the entity currently examined. It thus dynamically
instantiates instance matching and entity linking factor-graphs, computes posterior probabili-
ties for the matchings and linking, reaches a decision, updates the priors, and stores back all
results before de-instantiating the graph and moving to the next instance/entity.
4.6 Experiments on Instance Matching
In this section, we experimentally evaluate the effectiveness of ZenCrowd for the Instance
Matching (IM) task. ZenCrowd is a relatively sophisticated system involving many components.
In the following, we present and discuss the results of a series of focused experiments, each
designed to illustrate the performance of a particular feature of our IM pipeline. We present
extensive experimental results evaluating the Entity Linking pipeline (depicted using an
orange background in Figure 4.1) in Section 4.7. Though many other experiments could have
been performed, we believe that the set of experiments presented below gives a particularly
accurate account of the performance of ZenCrowd for the IM task. We start by describing our
experimental setting below.
4.6.1 Experimental Setting
To evaluate the ZenCrowd IM pipeline based on Probabilistic Networks as well as on crowd-
sourcing, we use the following datasets: The ground truth matching data comes from the
Data Interlinking task from the Instance Matching track of the Ontology Alignment Evaluation
Initiative (OAEI) in 20117. In this competition, the task was to match a given New York Times
(NYT) URI8 to the corresponding URI in DBpedia, Freebase, and Geonames. The evaluation of
automatic systems is based on manual matchings created by the NYT editorial team. Starting
from such data, we obtained the corresponding Freebase-to-DBpedia links via transitivity
through NYT instances. Thus, the ground truth is available for the task of matching a Freebase
instance to the corresponding one in DBPedia, which is more challenging then the original
task as both Freebase and DBPedia are very large datasets generated semi-automatically as
compared to NYT data which is small and manually curated.
In addition, we use a standard graph dataset containing data about all instances in our test-
set (that is, the Billion Triple Challenge BTC 2009 dataset9) in order to run our graph-based
schema matching approach and to retrieve data that is presented to the crowd. The BTC 2009
consists of a crawl of RDF data from the Web containing more than one billion facts about 800
million instances.
7http://oaei.ontologymatching.org/2011/instance/8http://data.nytimes.com/9http://km.aifb.kit.edu/projects/btc-2009/
54
4.6. Experiments on Instance Matching
First Blocking Phase: LOD Indexing and Instance Ranking. In order to select candidate
matchings for the source instance, we adopt IR techniques similar to those that have been
used by participants of the Entity Search evaluation at the Semantic Search workshop for
the AOR task, where a string representing an entity (i.e., the query) is used to rank URIs that
identify the entity. We build an inverted index over 40 million instance labels in the considered
LOD datasets, and run queries against it using the source instance labels in our test collection.
Unless specified otherwise, the top-5 results ranked by TF-IDF are used as candidates for the
crowdsourcing task after their confidence score has been computed.
Micro-Task Generation and ZenCrowd Aggregation. To evaluate the quality of each step
in the ZenCrowd IM pipeline, we selected a subset of 300 matching pairs from the ground
truth of different categories (100 persons, 100 locations, and 100 organizations). Then we
crowdsourced the entire collection to compare the quality of crowd matching against other
automatic matching techniques and their combinations.
The crowdsourcing tasks were run over Amazon Mechanical Turk10 as two independent
experiments for the two proposed matching interfaces (see Section 4.4.2). Each matching task
has been assigned to five different workers and was remunerated $0.05 each, employing a total
of 91 workers11.
We aggregate the results from the crowd using the method described in Section 4.5, with
an initial training phase consisting of 5 entities, and a second, continuous training phase,
consisting of 5% of the other entities being offered to the workers (i.e., the workers are given a
task whose solution is known by the system every 20 tasks on average).
Evaluation Measures. In order to evaluate the effectiveness of the different components, we
compare—for each instance—the selected matches against the ground truth that provides
matching/non-matching data for each source instance. Specifically, we compute (P)recision
and (R)ecall which are defined as follows: We consider as true positives (tp) all cases where
both the ground truth and the approach select the same matches, false positives (fp) the cases
where the approach selects a match which is not considered as correct by the ground truth,
and false negatives (fn) the cases where the approach does non select a match while the ground
truth does. Then, Precision is defined as P = t p/(t p + f p) and Recall as R = t p/(t p + f n).
In the following, all the final matching approaches (automatic, crowd majority vote, and
ZenCrowd) are optimized to return high precision values. We decided to focus on Precision
from the start, since from our experience it is the most useful metric in practice but we have
observed that high Recall is obtained in most configurations.
10http://www.mturk.com11The test-set we have created together with the matching results from the crowd are available for download at
the page: http://exascale.info/ZenCrowd
55
Chapter 4. Human Intelligence Task Quality Assurance
4.6.2 Experimental Results
In the following we report the experimental results aiming at comparing the effectiveness
of different matching techniques at different stages of the blocking process. In detail, we
compare the results of our inverted index based matching, which is highly scalable but not
particularly effective, the matching based on schema information, and the matching provided
by the crowd whose results are excellent but which is not cost and time efficient because of the
high monetary cost it necessitates and of the high latency it generates.
Recall of the First Blocking Phase. The first evaluation we perform is centered on the initial
blocking phase based on keyword queries over the inverted index. It is critical that such a
step, while being efficiently performed over a large amount of potential candidate matchings,
preserves as many correct results as possible in the generated ranked list (i.e., high Recall) in
order for the subsequent matching phases to be effective. This allows the graph and crowd
based matching schemes to focus on high Precision in turn.
Figure 4.6 shows how Recall varies by considering the top-N results as ranked by the inverted
index using TF-IDF values. As we can see, we retrieve the correct matches for all the instances
in our test-set after five candidate matches already.
1 2 3 4 575
808590
95100
Top−N TF−IDF Results
Rec
all
Figure 4.6 – Maximum achievable Recall by considering top-K results from the the invertedindex.
Second Blocking Phase: Matching Confidence Function. The second blocking step in-
volves the use of a matching confidence measure. This function measures the likelihood
of a match given a pair of instances based on schema matching results and string comparison
on the values directly attached to the instances in the graph (see Section 4.3.5). The goal of
such a function is to be able to identify the matching pairs that are worth to crowdsource in
order to improve the effectiveness of the system.
Figure 4.7 shows how Precision and Recall vary by considering matching pairs that match best
according to our schema-based confidence measure. Specifically, by setting a threshold on
the confidence score we can let the system focus either on high Precision or on high Recall.
For instance, if we only trust matches with a confidence value of 1.0 then Precision is at
is maximum (100%), but the recall is low (25%). That is, we would need to initiate many
crowdsourcing tasks to compensate.
56
4.6. Experiments on Instance Matching
Figure 4.7 – Precision and Recall as compared to Matching confidence values.
Final Phase: Crowdsourcing and Probabilistic Reasoning. After the confidence score has
been computed and the matching pairs have been selected, our system makes it possible to
crowdsource some of the results and aggregate them into a final matching decision. A standard
approach to aggregate the results from the crowd is majority voting: the 5 automatically
selected candidate matchings are all proposed to 5 different workers who have to decide which
matching is correct for the given instance. After the task is completed, the matching with most
votes is selected as valid matching. Instead, the approach used by ZenCrowd is to aggregate
the crowd results by means of the Probabilistic Network described in Section 4.5.
Table 4.2 shows the precision values of the crowd on all the matching pairs in our test-set.
Table 4.3 shows the precision values of the automatic approaches and their combinations with
the crowd results based both on Majority Voting as well as using ZenCrowd.
HIT Aggregation Organizations People Locations
Label-onlyMaj.Vote 0.67 0.70 0.65
ZenCrowd 0.77 0.75 0.73
MoleculeMaj.Vote 0.74 0.85 0.73
ZenCrowd 0.81 0.87 0.81
Table 4.2 – Crowd Matching Precision over two different HIT design interfaces (Label-only andMolecule) and two different aggregation methods (Majority Vote and ZenCrowd).
Organizations People LocationsInverted Index Baseline 0.78 0.98 0.89Majority Vote 0.87 0.98 0.96ZenCrowd 0.89 0.98 0.97
Table 4.3 – Matching Precision for purely automatic and hybrid human/machine approaches.
From Table 4.2, we observe that i) the crowd performance improves by using the Molecule
interface, that is, displaying data about the matching candidates directly from the graph
database leads to higher Precision consistently across different entity types as compared to
the interface that only displays the instance name and lets the worker click on their link to
obtain additional information; We also observe that ii) the Probabilistic Network used by
ZenCrowd to aggregate the outcome of crowdsourcing outperforms the standard Majority
Vote aggregation scheme in all cases.
57
Chapter 4. Human Intelligence Task Quality Assurance
0 0.25 0.5 0.75 10
50
100
150
200
250
Confidence ValueN
umbe
r of H
its
Figure 4.8 – Number of tasks generated for a given confidence value.
From Table 4.3 we can see that ZenCrowd outperforms i) the purely automatic matching base-
line based on the inverted index ranking function as well as ii) the hybrid matching approach
based on automatic ranking, schema-based matching confidence, and crowdsourcing. Addi-
tionally we observe that the most challenging type of instances to match in our experiment
is Organizations while People can be matched with high Precision using automatic methods
only. On average over the different entity types, we could match data with a 95% accuracy12
(as compared to the initial 88% average accuracy of the purely automatic baseline).
Crowdsourcing Cost Optimization. In addition to being interested in the effectiveness of
the different matching methods, we are also interested in their cost in order to be able to
select the best trade-off among the available combinations. In the following, we report on
results focusing on an efficient selection of the matching pairs that the system crowdsources.
After the initial blocking step based on the inverted index (that is able to filter out most of the
non-relevant instances) we compute a confidence matching score for all top ranked instances
using the schema-based method. This second blocking step allows ZenCrowd to select, based
on a threshold on the computed confidence score, which matching pairs to crowdsource.
Setting a threshold allows to crowdsource cases with low confidence only.
Figure 4.8 shows how many HITs are generated by ZenCrowd by varying the threshold on
the confidence score. As we can see when we set the confidence threshold to 0 then we
trust completely the automatic approach and crowdsource no matching. By increasing the
threshold on the matching confidence we are required to crowdsource matchings for more
than half of out test-set instances. Compared to Figure 4.7 we can see that the increase in the
gap between Precision and Recall corresponds to the number of crowdsourced tasks: if Recall
is low we need to crowdsource new matching tasks to obtain results about those instances the
automatic approach could not match with high confidence.
Crowd Performance Analysis. We are also interested in understanding how the crowd per-
forms on the instance matching task. Figure 4.9 shows the trade-off between the crowdsourc-
ing cost and the matching precision. We observe that our system is able to improve the overall
matching Precision by rewarding more workers (i.e., we select top-K workers based on their
12This is the average accuracy over all entity types reported in Table 4.3.
58
4.6. Experiments on Instance Matching
Figure 4.9 – ZenCrowd money saving by considering results from top-K workers only.
prior probability which is computed according to their past performance). On the other hand,
it is possible to reduce the cost (as compared to the original 5 workers setup) with a limited
loss in Precision by considering fewer workers.
Table 4.4 compares the crowd performance over the two different HIT designs. When compar-
ing the two designs, we can observe that more errors are done with the Label-only interface
(i.e., 66 vs 38) as the workers do not have much information directly on the HIT page. Interest-
ingly, we can also see that the common errors are minimal (i.e., 20 out of 300) which motivates
further analysis and possible combinations of the two designs.
Label-onlyCorrect Wrong
MoleculeCorrect 176 66Wrong 38 20
Table 4.4 – Correct and incorrect matchings as by crowd Majority Voting using two differentHIT designs.
Figure 4.10 presents the worker accuracy as compared to the number of tasks performed by
the worker. As we can see most of the workers reach Precision values higher than 50% and the
workers who contributed most provide high quality results. When compared with the worker
Precision over the Entity Linking task (see Figure 4.16 top) we can see that while the Power Law
distribution of completed HITs remains (see Figure 4.17), the crowd Precision on the Instance
Matching task is clearly higher than on the Entity Linking task.
Finally, we briefly comment on the efficiency of our IM approach. In its current implemen-
tation, ZenCrowd takes on average 500ms to select and rank candidate matchings out of
the inverted index, 125ms to obtain instance information from the graph DB, and 500ms to
generate a micro-matching task on the crowdsourcing platform. The decision process takes
on average 100ms. Without taking into account any parallelization, our system can thus offer
a new matching task to the crowd roughly every second, which in our opinion is sufficient
for most applications. Once on the crowdsourcing platform, the tasks have a much higher
latency (several minutes to a few hours), latency which is however mitigated by the fact that
instance matching is an embarrassingly parallel operation on crowdsourcing platforms (i.e.,
large collections of workers can work in parallel at any given point in time).
59
Chapter 4. Human Intelligence Task Quality Assurance
0 100 2000
50
100
Number of Hits per Worker
Prec
isio
n of
the
Wor
ker %
Worker
Figure 4.10 – Distribution of the workers’ precision using the Molecule design as compared tothe number of tasks performed by the workers.
4.6.3 Discussion
Looking back at the experimental results presented so far, we first observe that crowdsourcing
instance matching is useful to improve the effectiveness of an instance matching system. State
of the art majority voting crowdsourcing techniques can relatively improve Precision up to
12% over a purely automatic baseline (going from 0.78 to 0.87). ZenCrowd takes advantage of
a probabilistic framework for making decisions and performs even better, leading to a relative
performance improvement up to 14% over our best automatic matching approach (going
from 0.78 to 0.89)13.
A more general observation is that instance matching is a challenging task, which can rapidly
become impractical when errors are made at the initial blocking phases. Analyzing the
population of workers on the crowdsourcing platform (see Figure 4.17), we observe that the
number of tasks performed by a given worker exhibit a long tail distribution (i.e., few workers
perform many tasks, while many workers perform a few tasks only). Also, we observe that
the average precision of the workers is broadly distributed between [0.5,1] (see Figure 4.10).
As workers cannot be selected dynamically for a given task on the current crowdsourcing
platforms (all we can do is prevent some workers from receiving any further task through
blacklisting or decide not to reward workers who consistently perform bad), obtaining perfect
matching results is thus in general unrealistic for non-controlled settings.
4.7 Experiments on Entity Linking
4.7.1 Experimental Setting
Dataset Description. In order to evaluate ZenCrowd on the Entity Linking (EL) task, we
created an ad-hoc test collection14. The collection consists of 25 news articles written in
English from CNN.com, NYTimes.com, washingtonpost.com, timesofindia.indiatimes.com,
13The improvement is statistically significant (t-test p < 0.05).14The test collection we created is available for download at: http://exascale.info/zencrowd/.
60
4.7. Experiments on Entity Linking
and swissinfo.com, which were manually selected to cover global interest news (10), US local
news (5), India local news (5), and Switzerland local news (5). After the full text of the articles
has been extracted from the HTML page [93], 489 entities were extracted from it using the
Stanford Parser [92] as entity extractor. The collection of candidate URIs is composed of all
entities from DBPedia15, Freebase16, Geonames17, and NYT18, summing up to approximately
40 million entities (23M from Freebase, 9M from DBPedia, 8M from Geonames, 22K from
NYT). Expert editors manually selected the correct URIs for all the entities in the collection to
create the ground truth for our experiments. Crowdsourcing was performed using the Amazon
MTurk19 platform where 80 distinct workers have been employed. A single task, paid $0.01,
consisted of selecting the correct URIs out of the proposed five URIs for a given entity.
In the following, we present and discuss the results of a series of focused experiments, each
designed to illustrate the performance of a particular feature of our EL pipeline or of related
techniques. We start by describing a relatively simple base-configuration for our experimental
setting below.
LOD Indexing, Entity Linking and Ranking. In order to select candidate URIs for an entity,
we adopt the same IR techniques used for the IM task. We build an inverted index over 40
million entity labels in the considered LOD datasets, and run queries against it using the
entities extracted from the news articles in the test collection. Unless specified otherwise, the
top 5 results ranked by TF-IDF are used as candidates for the crowdsourcing task.
Micro-Task Generation. We dynamically create a task on MTurk for each entity sent to
the crowd. We generate a micro-task where the entity (possibly with some textual context)
is shown to the worker who has then to select all the URIs that match the entity, with the
possibility to click on the URI and visit the corresponding webpage. If no URI matches the
entity, the worker can select the “None of the above” answer. An additional field is available
for the worker to leave comments.
Evaluation Measures. In order to evaluate the effectiveness of our EL methods we compare,
for each entity, the selected URIs against the ground truth which provides matching/non-
matching information for each candidate URI. Similarly to what we did for the IM task eval-
uation, we compute (P)recision, (R)ecall, and (A)ccuracy which are defined as follows: We
consider as true positives (tp) all cases where both the ground truth and the approach select
the URI, true negatives (tn) the cases where both the ground truth and the approach do not se-
lect the URI for the entity, false positives (fp) the cases where the approach selects a URI which
15http://dbpedia.org/16http://www.freebase.com/17http://www.geonames.org/18http://data.nytimes.com/19http://www.mturk.com
61
Chapter 4. Human Intelligence Task Quality Assurance
All Entities Linkable EntitiesP R P R
GL News 0.27 0.67 0.40 1.0US News 0.17 0.46 0.36 1.0IN News 0.22 0.62 0.36 1.0SW News 0.21 0.63 0.34 1.0All News 0.24 0.63 0.37 1.0
Table 4.5 – Performance results for the candidate selection approach.
is not considered correct by the ground truth, and false negatives (fn) the cases where the ap-
proach does non select a URI that is correct in the ground truth. Then, Precision is defined as
P = t p/(t p+ f p), Recall as R = t p/(t p+ f n), and Accuracy as A = (t p+tn)/(t p+tn+ f p+ f n).
In the following, all the final EL approaches (automatic, majority vote, and ZenCrowd) are op-
timized to return high precision values. We decided to focus on precision from the start, since
from our experience it is the most useful metric in practice (i.e., entity linking applications
typically tend to favor precision to foster correct information processing capabilities, at the
expense of not linking some of the entities).
4.7.2 Experimental Results
Entity Extraction and Linkable Entities. We start by evaluating the performance of the
entity extraction process. As described above, we use a state of the art extractor (the Stanford
Parser) for this task. According to our ground truth, 383 out of the 488 automatically extracted
entities can be correctly linked to URIs in our experiments, while the remaining ones are either
wrongly extracted, or are not available in the LOD cloud we consider. Unless stated otherwise,
we average our results over all linkable entities, i.e., all entities for which at least one correct
link can be picked out (we disregard the other entities for several experiments, since they were
wrongly extracted from the text or are not at all available in the LOD data we consider and
thus can be seen as a constant noise level in our experiments).
Candidate Selection. We now turn to the evaluation of our candidate selection method. As
described above, candidate selection consists in the present case in ranking URIs using TF-IDF
given an extracted entity20. We focus on high Recall for this phase (i.e., we aim at keeping
as many potentially interesting candidates as possible), and decided to keep the top-5 URIs
produced by this process. Thus, we aim at preserving as many correct URIs as possible for
later linking steps (e.g., in order to provide good candidate URIs to the crowd). We report on
the performance of candidate selection in Table 4.5.
20Our approach is hence similar to [29], though we do not use BM25F as a ranking function.
62
4.7. Experiments on Entity Linking
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall of Top
5
Max Matching Probability
Figure 4.11 – Average Recall of candidate selection when discriminating on max relevanceprobability in the candidate URI set.
As we can observe, results are consistent with our goal since all interesting candidates are
preserved by this method (Recall of 1 for the linkable entities set).
Then, we examine the potential role of the highest confidence scores in the candidate selection
process. This analysis helps us decide when crowdsourcing an EL task is useful and when it is
not. In Figure 4.11, we report on the average recall of the top-5 candidates when classifying
results based on the maximum confidence score obtained (top-1 score). The results are
averaged over all extracted entities21.
As expected, we observe that high confidence values for the candidates selection lead to high
recall and, therefore, to candidate sets which contain many of the correct URIs. For this reason,
it is useful to crowdsource EL tasks only for those cases exhibiting relatively high confidence
values (e.g., > 0.5). When the highest confidence value in the candidate set is low, it is then
more likely that no URI will match the entity (because the entity has no URI in the LOD cloud
we consider, or because the entity extractor extracted the entity wrongly).
On the other hand, crowdsourcing might be unnecessary for cases where the Precision of the
automatic candidate selection phase is already quite high. The automatic selection techniques
can be adapted to identify the correct URIs in a completely automatic fashion. In the following,
we automatically select top-1 candidates only (i.e., the link with the highest confidence),
in order to focus on high Precision results as required by many practical applications. A
different approach focusing on recall might select all candidates with a confidence higher
than a certain threshold. Figure 4.12 reports on the performance of our fully automatic entity
linking approaches. We observe that when the top-1 URI is selected, the automatic approach
reaches a Precision value of 0.70 at the cost of low recall (i.e., fewer links are picked). As latter
results will show, crowdsourcing techniques can improve both Precision and Recall over this
automatic entity linking approaches in all cases.
21Confidence scores have all been normalized to [0,1] by manually defining a transformation function.
63
Chapter 4. Human Intelligence Task Quality Assurance
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Precision / Re
call
Matching Probability Threshold
P R
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1 2 3 4 5
Precision / Re
call
Top N Results
P R
Figure 4.12 – Performance results (Precision, Recall) for the automatic approach.
Entity Linking using Crowdsourcing with Majority Vote. We now report on the perfor-
mance of a state of the art crowdsourcing approach based on majority voting: the 5 au-
tomatically selected candidate URIs are all proposed to 5 different workers who have to decide
which URI(s) is (are) correct for the given entity. After the task is completed, the URIs with
at least 2 votes are selected as valid links (we tried various thresholds and manually picked 2
in the end since it leads to the highest precision scores while keeping good recall values for
our experiments). We report on the performance of this crowdsourcing technique in Table
4.6. The values are averaged over all linkable entities for different document types and worker
communities.
US Workers Indian WorkersP R A P R A
GL News 0.79 0.85 0.77 0.60 0.80 0.60US News 0.52 0.61 0.54 0.50 0.74 0.47IN News 0.62 0.76 0.65 0.64 0.86 0.63SW News 0.69 0.82 0.69 0.50 0.69 0.56All News 0.74 0.82 0.73 0.57 0.78 0.59
Table 4.6 – Performance results for crowdsourcing with majority vote over linkable entities.
The first question we examine is whether there is a difference in reliability between the various
populations of workers. In Figure 4.13 we show the performance for tasks performed by
workers located in USA and India (each point corresponds to the average Precision and Recall
over all entities in one document). On average, we observe that tasks performed by workers
located in the USA lead to higher precision values. As we can see in Table 4.6, Indian workers
obtain higher Precision and Recall on local Indian news as compared to US workers. The
biggest difference in terms of accuracy between the two communities can be observed on the
global interest news.
A second question we examine is how the textual context given for an entity influences the
worker performance. In Figure 7.11, we compare the tasks for which only the entity label is
64
4.7. Experiments on Entity Linking
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.2 0.4 0.6 0.8 1
Recall
Precision
US India
Figure 4.13 – Per document task effectiveness.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1 2 3 4 5 6 7 8 9 10
Precision
Document
Simple
Snippet
Figure 4.14 – Crowdsourcing results with two different textual contexts.
US Workers Indian WorkersP R A P R A
GL News 0.84 0.87 0.90 0.67 0.64 0.78US News 0.64 0.68 0.78 0.55 0.63 0.71IN News 0.84 0.82 0.89 0.75 0.77 0.80SW News 0.72 0.80 0.85 0.61 0.62 0.73All News 0.80 0.81 0.88 0.64 0.62 0.76
Table 4.7 – Performance results for crowdsourcing with ZenCrowd over linkable entities.
given (simple) to those for which a context consisting of all the sentences containing the entity
are shown to the worker (snippets). Surprisingly, we could not observe a significant difference
in effectiveness caused by the different textual contexts given to the workers. Thus, we focus
on only one type of context for the remaining experiments (we always give the snippet context).
Entity Linking with ZenCrowd. We now focus on the performance of the probabilistic infer-
ence network as proposed in this chapter. We consider the method described in Section 4.5,
with an initial training phase consisting of 5 entities, and a second, continuous training phase,
consisting of 5% of the other entities being offered to the workers (i.e., the workers are given a
task whose solution is known by the system every 20 tasks on average).
65
Chapter 4. Human Intelligence Task Quality Assurance
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Precision
Document
Agr. Vote
ZenCrowd
Top 1
Figure 4.15 – Comparison of three linking techniques.
In order to reduce the number of tasks having little influence in the final results, a simple
technique of blacklisting of bad workers is used. A bad worker (who can be considered as a
spammer) is a worker who randomly and rapidly clicks on the links, hence generating noise in
our system. In our experiments, we consider that 3 consecutive bad answers in the training
phase is enough to identify the worker as a spammer and to blacklist him/her. We report the
average results of ZenCrowd when exploiting the training phase, constraints, and blacklisting
in Section 4.7.2. As we can observe, precision and accuracy values are higher in all cases when
compared to the majority vote approach (see Table 4.6).
Finally, we compare ZenCrowd to the state of the art crowdsourcing approach (using the
optimal majority vote) and our best automatic approach on a per-task basis in Figure 4.15.
The comparison is given for each document in the test collection. We observe that in most
cases the human intelligence contribution improves the precision of the automatic approach.
We also observe that ZenCrowd dominates the overall performance (it is the best performing
approach in more than 3/4 of the cases).
Efficiency. Finally, we briefly comment on the efficiency of our approach. In its current im-
plementation, ZenCrowd takes on average 200ms to extract an entity from text, 500ms to select
and rank candidate URIs, and 500ms to generate a micro-linking task. The decision process
takes on average 100ms. Without taking into account any parallelization, our system can thus
offer a new entity to the crowd roughly every second, which in our opinion is sufficient for
most applications (e.g., enriching newspaper articles or internal company documents). Once
on the crowdsourcing platform, the tasks have a much higher latency (several minutes to a few
hours), latency which is however mitigated by the fact that entity linking is an embarrassingly
parallel operation on crowdsourcing platforms (i.e., large collections of workers can work in
parallel at any given point in time).
66
4.7. Experiments on Entity Linking
Top US Worker
0
0.5
1
0 250 500
Worker P
recision
Number of Tasks
US Workers
IN Workers
0.6 0.62 0.64 0.66 0.68 0.7
0.72 0.74 0.76 0.78 0.8
1 2 3 4 5 6 7 8 9
Precision
Top K workers
Figure 4.16 – Distribution of the workers’ Precision for the Entity Linking task as comparedto the number of tasks performed by the worker (top) and task Precision with top k workers(bottom).
4.7.3 Discussion
Looking at the experimental results about the EL task presented above, we observe that the
crowdsourcing step improves the overall EL effectiveness of the system.
Standard crowdsourcing techniques (i.e., using majority vote aggregation) yields a relative
improvement of 6% in Precision (from 0.70 to 0.74). ZenCrowd, by leveraging the probabilistic
framework for making decisions, performs better, leading to a relative performance improve-
ment ranging between 4% and 35% over the majority vote approach, and on average of 14%
over our best automatic linking approach (from 0.70 to 0.80). In both cases, the improvement
is statistically significant (t-test p < 0.05).
Analyzing worker activities on the crowdsourcing platform (see Figure 4.17), we observe that
the number of tasks performed by a given worker is Zipf-distributed (i.e., few workers perform
many tasks, while many workers perform a few tasks only).
Augmenting the numbers of workers performing a given task is not always beneficial: Figure
4.16, bottom, shows how the average Precision of ZenCrowd when (virtually) employing the
available top-k workers for a given task. As can be seen from the graph, the quality of the
results gets worse after a certain value of k, as more and more mediocre workers are picked
out. As a general rule, we observe that limiting the number of workers to 4 or 5 good workers
for a given task gives the best results.
The intuition behind using the Probabilistic Network is that a worker who proves that he is
good, i.e., has a high prior probability, should be trusted for future jobs. Furthermore, his/her
answer should always prevail and help identifying other good workers. Also, the Probabilistic
Network takes advantage of constraints to help the decision process.
While the datasets used for the IM and EL evaluations are different, we can make some
observation on the average effectiveness reached for each task. On average, the effectiveness
of the workers on the IM task is higher than that on the EL task. However, we observe that
67
Chapter 4. Human Intelligence Task Quality Assurance
Figure 4.17 – Number of HITs completed by each worker for both IM and EL ordered by mostproductive workers first.
ZenCrowd is able to exploit the work performed by the most effective workers (e.g., top US
worker in Figure 4.16 top or the highly productive workers in Figure 4.10).
4.8 Related Work on Entity Linking and Instance Matching
4.8.1 Instance Matching
The first task that we address is that of matching instances of multiple types among two
datasets. Thanks to the LOD movement, many datasets describing instances have been
created and published on the Web.
A lot of attention has been put on the task of automatic instance matching, which is defined as
the identification of the same real world object described in two different datasets. Classical
matching approaches are based on string similarities (“Barack Obama” vs. “B. Obama”) such
as the edit distance [106], the Jaro similarity [81], or the Jaro-Winkler similarity [169]. More
advanced techniques, such as instance Group Linkage [126], compare groups of records to find
matches. A third class of approaches uses semantic information. Reference Reconciliation
[57], for example, builds a dependency graph and exploits relations to propagate information
among entities. Recently, approaches exploiting Wikipedia as background corpus have been
proposed as well [36, 43]. In [72], the authors propose entity disambiguation techniques using
relations between entities in Wikipedia and concepts. The technique uses for example the link
between “Micheal Jordan” and the “University of California, Berkeley” or to “basketball” on
Wikipedia.
The number of candidate matching pairs between two datasets grows rapidly (i.e., quadrat-
ically) with the size of the data, making the matching task rapidly intractable in practice.
Methods based on blocking [167, 127] have been proposed to tackle scalability issues. The
idea is to adopt a computationally inexpensive method to first group together candidate
matching pairs and, as a second step, to adopt a more accurate and expensive measure to
compare all possible pairs within the candidate set.
68
4.8. Related Work on Entity Linking and Instance Matching
Crowdsourcing techniques have already been leveraged for instance matching. In [164], the
authors propose a hybrid human-machine approach that exploits both the scalability of
automatic methods as well as the accuracy of manual matching. The focus of their work is
on how to best present the matching task to the crowd. Instead, our work focuses on how to
combine automated and manual matching by means of a three-stage blocking technique and
a Probabilistic Network able to identify and weight-out low quality answers.
In idMesh[45], the authors built disambiguation graphs based on the transitive closures of
equivalence links for networks containing uncertain information. Our present work focuses
on hybrid matching techniques for LOD datasets, combining both automated processes and
human computation in order to obtain a system that is both scalable and highly accurate.
4.8.2 Entity Linking
The other task performed by ZenCrowd is Entity Linking, that is, identifying instances from
textual content and linking them to their description in a database. Entities, that is, real world
objects described by a given schema/ontology, are increasingly becoming a first-class citizen
on the Web. A large amount of online search queries are about entities [133], and search
engines exploit entities and structured data to build their result pages [70]. In the field of
Information Retrieval (IR) a lot of attention has been given to entities: At TREC22, the main
IR evaluation initiative, the task of Expert Finding, Related Entity Finding, and Entity List
Completion have been studied [17, 19].
The problem of assigning identifiers to instances mentioned in textual content (i.e., entity
linking) has been widely studied by the database and the Semantic Web research communities.
A related effort has for example been carried out in the context of the OKKAM project23, which
suggested the idea of an Entity Name System (ENS) to assign identifiers to entities on the Web
[30].
The first step in entity linking consists in extracting entities from textual content. Several
approaches developed within the NLP field provide high-quality entity extraction for persons,
locations, and organizations [21, 40]. State of the art techniques are implemented in tools like
Gate [46], the Stanford parser [92] (which we use in our experiments), and Extractiv24. Once
entities are extracted, they still need to be disambiguated and matched to semantically similar
but syntactically different occurrences of the same real-world object (e.g., “Mr. Obama” and
“President of the USA”).
The final step in entity linking is that of deciding which links to retain in order to enrich the
entity. Systems performing such a task are available as well (e.g., Open Calais25, DBPedia
Spotlight [118]). Relevant approaches aim for instance at enriching documents by automat-
22http://trec.nist.gov23http://www.okkam.org24http://extractiv.com/25http://www.opencalais.com/
69
Chapter 4. Human Intelligence Task Quality Assurance
ically creating links to Wikipedia pages [120, 143], which can be seen as entity identifiers.
While previous work selects Uniform Resource Identifiers (URIs) from a specific corpus (e.g.,
DBPedia, Wikipedia), our approach is to assign entity identifiers from the larger LOD cloud26
instead.
4.9 Conclusions
We have presented a data integration system, ZenCrowd, based on a probabilistic framework
leveraging both automatic techniques and punctual human intelligence feedback captured
on a crowdsourcing platform. ZenCrowd adopts a novel three-stage blocking process that can
deal with very large datasets while at the same time minimizing the cost of crowdsourcing by
carefully selecting the right candidate matches to crowdsource.
As our approach incorporates a human intelligence component, it typically cannot perform
instance matching and entity linking tasks in real-time. However, we believe that it can still be
used in most practical settings, thanks to the embarrassingly parallel nature of data integration
in crowdsourcing environments. ZenCrowd provides a reliable approach to entity linking
and instance matching, which exploits the trade-off between large-scale automatic instance
matching and high-quality human annotation, and which according to our results improves
the precision of the instance matching results up to 14% over our best automatic matching
approach for the instance matching task. For the Entity Linking task, ZenCrowd improves
the precision of the results by 4% to 35% over a state of the art and manually optimized
crowdsourcing approach, and on average by 14% over our best automatic approach.
Finally, we can generalize the Data Integration use-case of ZenCrowd to any task with Multiple
Choice Questions stemming from a hybrid human-machine algorithm. The probabistic
framework that we have built can deal with noisy workers answers by assigning weights (or
priors) to them based on test questions. If a worker did not pass a test question he will
be assigned a score based on his peers (the ones answering the some task). Other priors
and constraints, out of the algorithms pre-processing step, can be added to the inference
framework if available.
The crowdsourcing model used in ZenCrowd (through AMT) provides no guarantees on which
crowd workers will perform a given task. As such, our algorithm can only be executed as a
post processing step to perform result aggregation. In the next chapter, we will investigate a
technique that will allow us to select which crowd workers to ask for a given task.
26http://linkeddata.org/
70
5 Human Intelligence Task Routing
5.1 Introduction
Human Intelligence Tasks are simple tasks that anyone (with the required cognitive abilities)
can perform. In some cases, it can be beneficial for the worker to be acquainted with the task
(or even trained) in order to provide more accurate, and possibly faster answers. Take, for
example, the case of marine life image labeling. It is natural that an expert, or even an amateur,
in marine species, can perform the task much easily and quickly than a crowd worker with an
access to an encyclopedia to double check his answers. Similarly, in the previous chapter, we
have asked crowd workers to perform IM/EL tasks on news articles. We suspect that many
workers would be more effective if they had some background knowledge about the articles
(e.g., news articles from their respective countries, or area of interest).
Current approaches to crowdsourcing adopt a pull methodology where tasks are published
on specialized Web platforms where workers can pick their preferred tasks on a first-come-
first-served basis. While this approach has many advantages, such as simplicity and short
completion times, it does not guarantee that the task is performed by the most suitable worker.
In this chapter, we propose and evaluate Pick-A-Crowd, a software architecture to crowdsource
micro-tasks based on pushing tasks to specific workers. Our system constructs user profiles
for each worker in the crowd in order to assign HITs to the most suitable available worker.
We build such worker profiles based on information available on social networks using, for
instance, information about the worker personal interests. The underlying assumption is that
if a potential worker is interested in several specific categories (e.g., movies), he/she will be
more competent at tackling HITs related to that category (e.g., movie genre classification). In
our system, workers and HITs are matched based on an underlying taxonomy that is defined
on categories extracted both from the tasks at hand and from the workers’ interests. Entities
appearing in the users’ social profiles are linked to the Linked Open Data (LOD) cloud1, where
they are then matched to related tasks that are available on the crowdsourcing platform. We
1http://linkeddata.org/
71
Chapter 5. Human Intelligence Task Routing
experimentally evaluate our push methodology and compare it against traditional crowd-
sourcing approaches using tasks of varying types and complexity. Results show that the quality
of the answers is significantly higher when using a push methodology.
In summary, the contributions described in this chapter are:
• a Crowdsourcing framework that focuses on pushing HITs to the crowd.
• a software architecture that implements the newly proposed push crowdsourcing method-
ology.
• category-based, text-based, and graph-based approaches to assign HITs to workers
based on links in the LOD cloud.
• an empirical evaluation of our method in a real deployment over different crowds
showing that our Pick-A-Crowd system is on average 29% more effective than traditional
pull crowdsourcing platforms over a variety of HITs.
The rest of this chapter is structured as follows: Section 5.2 gives an overview of the architecture
of our system, including its HIT publishing interface, its crowd profiling engine, and its HIT
assignment and reward estimation components. We introduce our formal model to match
human workers to HITs using category-based, text-based, and graph-based approaches in
Section 5.3. We describe our evaluation methodology and discuss results we obtained from a
real deployment of our system in Section 5.4. We review the related worker on Recommender
Systems, and Expert Finding in Section 5.5, before concluding in Section 5.6.
5.2 System Architecture
In this section, we describe the Pick-A-Crowd framework and provide details on each of its
components.
5.2.1 System Overview
Figure 7.2 gives a simplified overview of our system. Pick-a-Crowd receives as input tasks
that need to be completed by the crowd. The tasks are composed of a textual description,
which can be used to automatically select the right crowd for the task, actual data on which to
run the task (e.g., a Web form and set of images with candidate labels), as well as a monetary
budget to be spent to get the task completed. The system then creates the HITs, and predicts
the difficulty of each micro-task based on the crowd profiles and on the task description.
The monetary budget is split among the generated micro-tasks according to their expected
difficulty (i.e., a more difficult task will be given a higher reward). The HITs are then assigned
to selected workers from the crowd and published on the social network application. Finally,
answers are processed as a stream from the crowd, aggregated and sent back to the requester.
We detail the functionalities provided by each component of the system in the following.
72
5.2. System Architecture
Figure 5.1 – Pick-A-Crowd Component Architecture. Task descriptions, Input Data, and aMonetary Budget are taken as input by the system, which creates HITs, estimates their difficultyand suggests a fair reward based on the skills of the crowd. HITs are then pushed to selectedworkers and results get collected, aggregated, and finally returned back to the requester.
5.2.2 HIT Generation, Difficulty Assessment, and Reward Estimation
The first pipeline in the system is responsible for generating the HITs given some input data
provided by the requester. HITs can for instance be generated from i) a Web template to classify
images in pre-defined categories, together with ii) a set of images and iii) a list of pre-defined
categories. The HIT Generator component dynamically creates as many tasks as required (e.g.,
one task per image to categorize) by combining those three pieces of information.
Next, the HIT Difficulty Assessor takes each HIT and determines a complexity score for it.
This score is computed based on both the specific HIT (i.e., description, keywords, candidate
answers, etc.) and on the worker profiles (see Section 5.3 for more detail on how such profiles
are constructed). Different algorithms can be implemented to assess the difficulty of the tasks
in our framework. For example, a text-based approach can compare the textual description of
the task with the skill description of each worker and compute a score based on how many
workers in the crowd could perform well on such HITs.
An alternative a more advanced prediction method can exploit entities involved in the task
and known by the crowd. Entities are extracted from the textual descriptions of the tasks and
disambiguated to LOD entities. The same can be performed on the worker profiles: each
Facebook page that is liked by the workers can be linked to its respective LOD entities. Then
the set of entities representing the HITs and the set of entities representing the interests of the
crowd can be directly compared. The task is classified as difficult when the entities involved in
the task heavily differ from the entities liked by the crowd.
73
Chapter 5. Human Intelligence Task Routing
A third example of task difficulty prediction method is based on Machine Learning. A classifier
assessing the task difficulty is trained by means of previously completed tasks, their description
and their result accuracy. Then, the description of a new task is given as a test vector to the
classifier, which returns the predicted difficulty for the new task.
Finally, the Reward Estimation component takes as input a monetary budget B and the results
of the HIT assessment to determine a reward value for each HIT hi .
A simple way to redistribute the available monetary budget is simply by rewarding the same
amount of money for each task of the same type. An second example of reward estimation
function is:
r ew ar d(hi ) = B ·d(hi )∑j d(h j )
(5.1)
which takes into account the difficulty d() of the HIT hi as compared to the others and assigns
a higher reward to more difficult tasks.
A third approach computes a reward based on both the specific HIT as well as the worker
who performs it. In order to do this, we can exploit the HIT assignment models adopted by
our system. These models generate a ranking of workers by means of computing a function
match(w j ,hi ) for each worker w j and HIT hi (see Section 5.3). Given such a function, we can
assign a higher reward to better suited workers by
r ew ar d(hi , w j ) = B ·match(w j ,hi )∑k,l match(wk ,hl )
(5.2)
More advanced reward schemes can be applied as well. For example, in [83], authors propose
game theoretic based approaches to compute the optimal reward for paid crowdsourcing
incentives in the presence of workers who collude in order to game the system.
Exploring and evaluating different difficulty prediction and reward estimation approaches is
not our focus and is left as future work.
5.2.3 Crowd Profiler
The task of the Crowd Profiler component is to collect information about each available
worker in the crowd. Pick-A-Crowd uses contents available on the social network platform as
well as previously completed HITs to construct the workers’ profiles. Those profiles contain
information about the skills and interests of the workers and are used to match HITs with
available workers in the crowd.
In detail, this module generates a set of worker profiles C = {w1, .., wn} where wi = {P,T }, P is
the set of worker interests (e.g., when applied on top of the Facebook platform pi ∈ P are the
Facebook pages the worker likes) and Ti = {t1..tn} is the set of tasks previously completed by
74
5.2. System Architecture
wi . Each Facebook page pi belongs to a category in the Facebook Open Graph2.
5.2.4 Worker Profile Linker
This component is responsible for linking each Facebook page liked by some worker to the
corresponding entity in the LOD cloud. Given the page name and, possibly, a textual descrip-
tion of the page, the task is defined as identifying the correct URI among all the ones present
in the LOD graph using, for example, a similarity measure based on adjacent nodes in the
graph. This is a well studied problem where both automatic [71] or crowdsourcing-based
techniques [50] can be used.
5.2.5 Worker Profile Selector
HITs and workers are matched based on the profiles described above. Intuitively, a worker
who only likes many music bands will not be assigned a task that asks him/her to identify who
is the movie actor depicted in the displayed picture. The similarity measure used for matching
workers to tasks takes into account the entities included in the workers’ profiles but is also
based on the Facebook categories their liked pages belong to. For example, it is possible to
use the corresponding DBPedia entities and their YAGO type. The YAGO knowledge-base
provides a fine-grained high-accuracy entity type categorization which has been constructed
by combining Wikipedia category assignments with WordNet synset information. The YAGO
type hierarchy can help the system better understand which type of entity correlates with the
skills required to effectively complete a HIT (see also Section 5.3 for a formal definition of such
methods). For instance, our graph-based approach concludes that for our music related task,
the top Facebook pages that indicate expertise on the topic are ‘MTV’ and ‘Music & top artists’.
A generic similarity measure to match workers and task is defined as
si m(w j = {P,T },hi = {t ,d , A,C at }) =∑
k,l si m(pk , al )
|P | · |A| ,∀pk ∈ P, al ∈ A (5.3)
where A is the set of candidate answers for task hi and si m() measures the similarity between
the worker profile and the task description.
5.2.6 HIT Assigner and Facebook App
The HIT Assigner component takes as input the final HITs with the defined reward and
publishes them onto the Facebook App. We developed a dedicated, native Facebook App called
SocialBrain{r}3 to implement this final component of the Pick-A-Crowd platform. Figure 5.2
shows a few screenshots of SocialBrain{r}. As any other application on the Facebook platform,
it has access to several pieces of information about the users that accept to use it. We follow a
2https://developers.facebook.com/docs/concepts/opengraph/3http://apps.facebook.com/socialbrainr/
75
Chapter 5. Human Intelligence Task Routing
Figure 5.2 – Screenshots of the SocialBrain{r} Facebook App. Above, the dashboard displayingHITs available to a specific worker. Below, a HIT about actor identification assigned to a workerwho likes several actors.
non-intrusive approach; In our case, the liked pages for each user are stored in an external
database that is used to create a worker profile containing his/her interests. The application
we developed also adopts crowdsourcing incentive schemes different than the pure financial
one. For example, we use the fan incentive where a competition involving several workers
competing on trivia questions on their favorite topic can be organized. The app also allows
to directly challenge other social network contacts by sharing the task, which is also helpful
to enlarge the application user base. While from the worker point of view this represents a
friendly challenge, from a platform point of view this means that the HIT will be pushed to
another expert worker, following the assumption that a worker would challenge someone who
is also knowledgeable about the topic addressed by the task.
76
5.3. HIT Assignment Models
5.2.7 HIT Result Collector and Aggregator
The final pipeline is composed of stream processing modules, where the Facebook App an-
swers are being streamed from the crowd to the answer creation pipeline. The first component
collects the answers from the crowd and is responsible for a first quality check based on
potentially available gold answers for a small set of training questions. Then, answers that are
considered to be valid (based on available ground-truth data) are forwarded to the HIT Result
Aggregator component, which collects and aggregates them in the final answer for the HIT.
When a given number of answers has been collected (e.g., five answers), then the component
outputs the partial aggregated answer (e.g., based on majority vote) back to the requester. As
more answers reach the aggregation component, the aggregated answer presented to the re-
quester gets updated. Additionally, as answers are collected, the workers’ profiles get updated
and the reward gets granted to the workers who performed the task through the Facebook
App.
5.3 HIT Assignment Models
In this section, we define the HIT assignment tasks and describe several approaches for
assigning workers to such tasks. We focus on HIT assignment rather than on other system
components as the ability to assign tasks automatically is the most original feature of our
system as compared to other crowdsourcing platforms.
Given a HIT hi = {ti ,di , Ai ,C ati } from the requester, the task of assigning it to some workers
is defined as ranking all available workers C = {w1, .., wn} on the platform and selecting the
top-n ranked workers. A HIT consists of a textual description ti (e.g., the task instruction
which is being provided to the workers)4, a data field di that is used to provide the context
for the task to the worker (e.g., the container for an image to be labelled), and, optionally,
the set of candidate answers Ai = {a1, .., an} for the multiple-choices tasks (e.g, a list of music
genres used to categorize a singer) and a list of target Facebook categories C ati = {c1, ..cn}.
A worker profile w j = {P,T } is assigned a score based on which it is ranked for the task hi .
This score is determined based on the likelihood of matching w j to hi . Thus, the goal is to
define a scoring function match(w j ,hi ) based on the worker profile, the task description and,
possibly, external resources such as the LOD datasets or a taxonomy.
5.3.1 Category-based Assignment Model
The first approach we define to assign HITs to workers is based on the same idea that Facebook
uses to target advertisements to its users. A requester has to select the target community of
users who should perform the task by means of selecting one or more Facebook pages or
page categories (in the same way as someone who wants to place an ad). Such categories are
4When applied to hybrid human-machine systems ti can be defined as the data context of the HIT. For example,in crowdsourced databases ti can be the name of the column, table, etc. the HIT is about.
77
Chapter 5. Human Intelligence Task Routing
Figure 5.3 – An example of the Expert Finding Voting Model.
defined in a 2 levels structure with 6 top levels (e.g., “Entertainment”, “Company”), each of
them having several sub-categories (e.g., “Movie”, “Book”, “Song”, etc. are sub-categories of
“Entertainment”).
Once some second-level categories are selected by the requester, the platform can generate a
ranking of users based on the pages they like. More formally, given a set of target categories
C at = {c1, ..cn} from the requester, we define P (ci ) = {p1, .., pn} as the set of pages belonging
to category ci . Then, for each worker w j ∈C we take the set of pages he/she likes P (w j ) and
measure its intersection with the pages belonging to any category selected by the requester
RelP =∪i P (ci ). Thus, we can assign a score to the worker based on the overlap between the
likes and the target category |P (w j )∩RelP | and rank all w j ∈C based on such scores.
5.3.2 Expert Profiling Assignment Model
A second approach we propose to rank workers given a HIT hi is to follow an expert finding
approach. Specifically, we define a scoring function based on the Voting Model for expert
finding [110]. For the HIT we want to assign, we take the set of its candidate answers Ai , when
available. Then, we define a disjunctive keyword query based on all the terms composing
the answers q =∧i ai . In case Ai is not available, for example because the task is asking an
open-ended question, then q can be extracted out of ti by mining entities mentioned in its
content. The query q is then used to rank Facebook pages using an inverted index built over
the collection of documents ∪i Pi ∀w j ∈C . We consider each ranked page as a vote for the
workers who like them on Facebook and rank workers accordingly. That is, if Retr P is the set
of pages retrieved with q , we can define a worker ranking function as |P (w j )∩Retr P |. More
interestingly, we can take into account the ranking generated by q and give a higher score to
workers liking pages that were ranked higher. An example of how to rank workers following
the voting model is depicted in Figure 5.3.
78
5.3. HIT Assignment Models
5.3.3 Semantic-Based Assignment Model
The third approach we propose is based on third-party information. Specifically, we first link
candidate answers and pages to an external knowledge base (e.g., DBPedia) and exploit its
structure to better assign HITs to workers. For a given HIT hi , the first step is to identify the
entity corresponding to each a j ∈ Ai (if Ai is not available, entities in ti can be used instead).
This task is related to entity linking [50] and ad-hoc object retrieval [133, 151] where the goal is
to find the correct URI for a description of the entity using keywords. In this work, we take
advantage of state-of-the-art techniques for this task but do not focus on improving over
such techniques. Then, we identify the entity that represents each page liked by the crowd
whenever it exists in the knowledge base. Once both answers and pages are linked to their
corresponding entity in the knowledge base, we exploit the underlying graph structure to
determine the extent to which entities that describe the HIT and entities that describe the
interests of the worker are similar. Specifically, we define two scoring methods based on the
graph.
The first scoring method takes into account the vicinity of the entities in the entity graph.
We measure how many worker entities are directly connected to HIT entities using SPARQL
queries over the knowledge base as follows:
1 SELECT ?x2 WHERE { <uri(a_i)> ?x <uri(p_i)> }
This follows the assumption that a worker who likes a page is able to answer questions about
related entities. For example, if a worker likes the page ‘FC Barcelona’, then he/she might be a
good candidate worker to answer a question about ‘Lionel Messi’ who is a player of the soccer
team liked by the worker.
Our second scoring function is based on the type of entities. We measure how many worker
entities have the same type as the HIT entity using SPARQL queries over the knowledge base
as follows:
1 SELECT ?x2 WHERE { <uri(a_i)> <rdf:type > ?x .3 <uri(p_i)> <rdf:type > ?x4 }
The underlying assumption in that case is that a worker who likes a page is able to answer
questions about entities of the same type. For example, if a worker likes the pages ‘Tom Hanks’
and ‘Julia Roberts’, then he/she might be a good candidate worker to answer a question about
‘Meg Ryan’ as it is another entity of the same type (i.e., actor).
79
Chapter 5. Human Intelligence Task Routing
5.4 Experimental Evaluation
Given that the main innovation of Pick-A-Crowd as compared to classic crowdsourcing plat-
forms such as AMT is the ability to push HITs to workers instead of letting the workers select
the HITs they wish to work on, we focus in the following on the evaluation and comparison of
different HIT assignment techniques and compare them in terms of work quality against a
classic crowdsourcing platform.
5.4.1 Experimental Setting
The Facebook app SocialBrain{r} we have implemented within the Pick-A-Crowd framework
currently counts more than 170 workers who perform HITs requiring to label images contain-
ing popular or less popular entities and to answer open-ended questions. Overall, more than
12K distinct Facebook pages liked by the workers have been crawled over the Facebook Open
Graph. SocialBrain{r} is implemented using cloud-based storage and processing back-end to
ensure scalability with an increasing number of workers and requesters. SocialBrain{r} workers
have been recruited via AMT, thus making a direct experimental comparison to standard AMT
techniques more meaningful.
The type of task categories we evaluate our approaches on are: actors, soccer players, anime
characters, movie actors, movie scenes, music bands, and questions related to cricket. Our
experiments cover both multiple answer questions as well as open-ended questions: Each
task category includes 50 images for which the worker either has to select the right answer
among 5 candidate answers or to answer 20 open-ended questions related to the topic. Each
type of question can be skipped by the worker in case he/she has no idea about that particular
topic.
In order to analyze the performance of workers in the crowd, we measure Precision, Recall (as
the worker is allowed to skip questions when he/she does not know the answer), and Accuracy
of their answers for each HIT obtained via majority vote over 3 and 5 workers5.
5.4.2 Motivation Examples
As we can see from Figure 5.4, the HITs that asks questions about cricket clearly show how
workers can perform differently in terms of accuracy. There are 13 workers out of 35 who
were not able to provide any correct answer while the others spread over the Precision/Recall
spectrum, with the best worker performing at 0.9 Precision and 0.9 Recall. This example
motivates the need to selectively assign the HIT to the most appropriate worker and not
following a first-come-first-served approach as proposed, for example, by AMT. Thus, the goal
of Pick-A-Crowd is to adopt HIT assignment models that are able to identify the workers in
5The set of HITs and correct answers we used in our experiments are available for comparative studies online at:http://exascale.info/PickACrowd
80
5.4. Experimental Evaluation
Figure 5.4 – Crowd performance on the cricket task. Square points indicate the 5 workersselected by our graph-based model that exploits entity type information.
Figure 5.5 – Crowd performance on the movie scene recognition task as compared to moviepopularity.
the top-right area of Figure 5.4, based solely on their social network profile. As an anecdotal
observation, a worker from AMT provided as feedback to the cricket task in the available
comment field the following comment “I had no idea what to answer to most questions...”
which clearly demonstrates that for the tasks requiring background knowledge, not all workers
are a good fit.
An interesting observation is the impact of the popularity of a question. Figure 5.5 shows the
correlation between task accuracy on the movie scene recognition task and the popularity of
the movie based on the overall number of Facebook likes on the IMDB movie page. We can
observe that when a movie is popular, then workers easily recognize it. On the other hand,
when a movie is not so popular it becomes more difficult to find knowledgeable workers for
the task.
81
Chapter 5. Human Intelligence Task Routing
Figure 5.6 – SocialBrain{r} Crowd age distribution.
Figure 5.7 – SocialBrain{r} Notification click rate.
5.4.3 SocialBrain{r} Crowd Analysis
Figure 5.6 shows some statistics about the user base of SocialBrain{r}. The majority of workers
are in the age interval 25-34 and are from the United States.
Another interesting observation can be made about the Facebook Notification click rate.
Once the Pick-A-Crowd system selects a worker for a HIT, the Facebook app SocialBrain{r}
sends a notification to the worker with information about the newly available task and its
reward. Figure 5.7 shows a snapshot of the notifications clicked by workers as compared to the
notification sent by SocialBrain{r} over a few days. We observe an average rate of 57% clicks
per notification sent.
A third analysis looks on how the relevant likes of a worker correlates with his/her accuracy for
the task. Figure 5.8 shows a distribution of worker accuracy over the relevant pages liked using
the category-based HIT assignment model to define the relevance of pages. In a first look, we
do not see a perfect correlation between the number of likes and the worker accuracy for any
task. On the other hand, we observe that when many relevant pages are in the worker profile
(e.g., >30), then accuracy tends to be high (i.e., the bottom-right part of the plot is empty).
However, when only a few relevant pages belong to the worker profile, then it becomes difficult
to predict his/her accuracy. Note that not-liking relevant pages is not an indication of being
unsuitable for a task: Having an incomplete profile just does not allow to model the worker
and to assign him/her the right tasks (i.e., the top-left part of the plot contains high-accuracy
workers with incomplete profiles). Having worker profiles containing several relevant pages is
not problematic when the crowd is large enough (as it is on Facebook).
82
5.4. Experimental Evaluation
Figure 5.8 – SocialBrain{r} Crowd Accuracy as compared to the number of relevant Pages aworker likes.
Task AMT 3 AMT 5 AMT Masters 3Soccer 0.8 0.8 0.1Actors 0.82 0.82 0.9Music 0.76 0.7 0.7
Book Authors 0.7 0.5 0.58Movies 0.6 0.64 0.66Anime 0.94 0.86 0.1Cricket 0.004 0 0.72
Table 5.1 – A comparison of the task accuracy for the AMT HIT assignment model assigningeach HIT to the first 3 and 5 workers and to AMT Masters.
5.4.4 Evaluation of HIT Assignment Models
In the literature, common crowdsourcing tasks usually adopt 3 or 5 assignments of the same
HIT in order to aggregate the answers from the crowd, for example by majority vote. In the
following, we compare different assignment models evaluating both the cases where 3 and
5 assignments are considered for a given HIT. As a baseline, we compare against the AMT
model that assigns the HIT to the first n workers performing the task. We also compare against
AMT Masters who are workers being awarded a special status by Amazon based on their past
performances6. Our proposed models first rank workers in the crowd based on their estimated
accuracy and then assign the task to the top-3 or top-5 workers.
Table 1 presents an overview of the performances of the assignment model used by AMT. We
observe that while on average there is not a significant difference between using 3 or 5 workers,
Masters perform better than the rest of the AMT crowd on some tasks but do not outperform
the crowd on average (0.54 versus 0.66 Accuracy). A per-task analysis shows that some tasks
are easier than others: While tasks about identifying pictures of popular actors obtain high
accuracy for all three experiments, topic-specific tasks such as cricket questions may lead to a
very low accuracy.
6Note that to be able to recruit enough Masters for our tasks we had to reward $1.00 per task as compared to$0.25 granted to standard workers.
83
Chapter 5. Human Intelligence Task Routing
Task Requester Selected Categories Category-based 3 Category-based 5Soccer Sport,Athlete,Public figure 0.94 0.98Actors Tv show, Comedian, Movie, Artist, Actor/director 0.94 0.96Music Musician/band,Music 0.96 0.96
Book Authors Author,Writer,Book 0.98 0.94Movies Movie,Movie general,Movies/music 0.44 0.74Anime Games/toys,Entertainment 0.62 0.7Cricket Sport,Athlete,Public figure 0.63 0.54
Table 5.2 – A comparison of the effectiveness for the category-based HIT assignment modelsassigning each HIT to 3 and 5 workers with manually selected categories.
Task VotingModel q = ti 3 VotingModel q = ti 5 VotingModel q = Ai 3 VotingModel q = Ai 5Soccer 0.92 0.92 0.86 0.86Actors 0.92 0.94 0.92 0.88Music 0.96 0.96 0.76 0.78
Book Authors 0.94 0.96 0.3 0.84Movies 0.70 0.60 0.70 0.42Anime 0.54 0.84 0.56 0.54Cricket 0.63 0.72 0.72 0.72
Table 5.3 – Effectiveness for different HIT assignments based on the Voting Model assigningeach HIT to 3 and 5 workers and querying the Facebook Page index with the task descriptionq = ti and with candidate answers q = Ai respectively.
Table 2 gives the results we obtained by assigning tasks based on the Facebook Open Graph
categories manually selected by the requester. We observe that the Soccer and Cricket tasks
have been assigned to the same Facebook category which does not distinguish between
different types of sports. Anyhow, we can see that for the cricket task the category-based
method does not perform well, as the pages contained into the categories cover many different
sports and, according to our crowd at least, soccer-related tasks are simpler than cricket-
related tasks.
Table 3 presents the results when assigning HITs following the Voting Model for expert finding.
We observe that in the majority of cases, assigning each task to 5 different workers selected
using the Facebook Page indexing the task description as query leads to the best results.
Table 4 shows the results of our graph-based approaches. We observe that in the majority
of these cases, the graph-based approach that follows the entity type (“En. type”) edges and
selects workers who like Pages of the same type as the entities involved in the HIT outperforms
the approach that considers the directly-related entities within one step in the graph (“1-step”).
5.4.5 Comparison of HIT Assignment Models
Table 5 presents the average Accuracy obtained over all the HITs in our experiments (which
makes a total of 320 questions) by each HIT assignment model. As we can see, our proposed
84
5.5. Related Work in Task Routing
Task En. type 3 En. type 5 1-step 3 1-step 5Soccer 0.98 0.92 0.86 0.86Actors 0.92 0.92 0.92 0.90Music 0.62 0.68 0.64 0.54
Book Authors 0.28 0.50 0.50 0.82Movies 0.70 0.78 0.46 0.62Anime 0.46 0.90 0.62 0.62Cricket 0.63 0.82 0.63 0.63
Table 5.4 – Effectiveness for different HIT assignments based on the entity graph in the DBPediaknowledge base assigning each HIT to 3 and 5 workers.
Assignment Method Average AccuracyAMT 3 0.66AMT 5 0.62
AMT Masters 3 0.54Category-based 3 0.79Category-based 5 0.83Voting Model ti 3 0.80Voting Model ti 5 0.85Voting Model Ai 3 0.69Voting Model Ai 5 0.72
En. type 3 0.66En. type 5 0.79
1-step 3 0.661-step 5 0.71
Table 5.5 – Average Accuracy for different HIT assignment models assigning each HIT to 3 and5 workers.
HIT assignment models outperform the standard first-come-first-served model adopted by
classic crowdsourcing platforms such as AMT. On average over the evaluated tasks, the best
performing model is the one based on the Voting Model defined for the expert finding problem
where pages relevant to the task are seen as votes for the expertise of the workers. Such an
approach obtains on average a 29% relative improvement over the best accuracy obtained by
the AMT model.
5.5 Related Work in Task Routing
5.5.1 Crowdsourcing over Social Networks
A first attempt to crowdsource micro-tasks on top of social networks has been proposed by
[54], where authors describe a framework to post questions as tweets that users can solve by
85
Chapter 5. Human Intelligence Task Routing
tweeting back an answer. As compared to this early approach, we propose a more controlled
environment where workers are known and profiled in order to push tasks to selected users.
Crowdsourcing over social networks is also used by CrowdSearcher [31, 32, 33], which im-
proves automatic search systems by means of asking questions to personal contacts. The
crowdsourcing architecture proposed in [34] considers the problem of assigning tasks to se-
lected workers. However, authors do not evaluate automatic assignment approaches but only
let the requesters manually select individual workers, which they want to push the task to.
Instead, in this work, we assess the feasibility and effectiveness of automatically mapping HITs
to workers based on their social network profiles.
Also related to our system is the study of trust in social networks. Golbeck [68], for instance,
proposes different models to rank social network users based on trust and applies them to
recommender systems as well as other end-user applications.
5.5.2 Task Recommendation
Assigning HITs to workers is similar to the task performed by recommender systems (e.g.,
recommending movies to potential customers). We can categorize recommender systems
into content-based and collaborative filtering approaches. The former approaches exploit
the resources contents and match them to user interests. The latter ones only use similarity
between user profiles constructed out of their interests (see [130] for a survey). Recommended
resources are those already consumed by similar users. Our systems adopts techniques from
the field of recommender systems as it aims at matching HITs (i.e., tasks) to human workers
(i.e., users) by constructing profiles that describe worker interests and skills. Such profiles are
then matched to HIT descriptions that are either provided by the task requester or by analyzing
the questions and potential answers included in the task itself (see Section 5.3). Recommender
systems built on top of social networks already exits. For example, in [9], authors propose a
news recommendation system for social network groups based on community descriptions.
5.5.3 Expert Finding
In order to push tasks to the right worker in the crowd, our system aims at identifying the most
suitable person for a given task. To do so, our Worker Profile Selector component generates
a ranking of candidate workers who can be contacted for the HIT. This is highly related to
the task of Expert Finding studied in Information Retrieval. The Enterprise track at the TREC
evaluation initiative7 has constructed evaluation collections for the task of expert finding
within an organizational setting [20]. The studied task is that of ranking candidate experts
(i.e., employees of a company) given a keyword query describing the required expertise. Many
approaches have been proposed for such tasks (see [18] for a comprehensive survey). We can
classify most of them as either document-based, when document ranking is performed before
7http://trec.nist.gov
86
5.6. Conclusions
identifying the experts, or as candidate-based, when expert profiles are first constructed before
being ranked given a query. Our system follows the former approach by ranking online social
network pages and using them to assign work to the best matching person.
5.6 Conclusions
A simplistic task allocation procedure, such as pull-crowdsourcing, is suboptimal when it
comes to efficiently leverage individual workers skills and point of interest to obtain high-
quality answers. For this reason, we proposed Pick-A-Crowd, a novel crowdsourcing scheme
focusing on pushing tasks to the right worker rather than letting the workers spend time
finding tasks that suit them. We described a novel crowdsourcing architecture that builds
worker profiles based on their online social network activities and tries to understand the
skills and interests of each worker. Thanks to such profiles, Pick-A-Crowd can assign each task
to the right worker dynamically.
To demonstrate and evaluate our proposed architecture, we have developed an deployed So-
cialBrain{r}, a native Facebook application that pushes crowdsourced tasks to selected workers
and collects the resulting answers. We additionally proposed and extensively evaluated HIT
assignment models based on 1) Facebook categories manually selected by the task requester,
2) methods adapted from an expert finding scenario in an enterprise setting, and 3) methods
based on graph structures borrowed from external knowledge bases. Experimental results
over the SocialBrain{r} user-base show that all of the proposed models outperform the classic
first-come-first-served approach used by standard crowdsourcing platforms such as Amazon
Mechanical Turk. Our best approach provides on average 29% better results than the AMTmodel.
A potential limitation of our approach is that it may lead to longer task completion times: While
on pull crowdsourcing platforms the tasks gets completed quickly (since any available worker
can perform the task), following a push methodology may lead to delays in the completion of
the tasks. In the next chapters, we will investigate techniques that will allow us to improve the
efficiency of a crowdsourcing campaign for both pull and push crowdsourcing.
87
6 Human Intelligence Task Retention
6.1 Introduction
We now turn our attention to improving the execution time for crowdsourcing tasks where
the timely completion is hardly guaranteed, and many factors influence its progression pace,
including: the crowd availability, time-of-day [136, 76], the amount of the micro-payments [62],
the number of remaining tasks in a given batch, concurrent campaigns, or the reputation of
the publisher [80]. A common observation that is often made when running a crowdsourcing
campaign on micro-task crowdsourcing platforms is the long-tail distribution of work done
by people [64, 50, 76]: Many workers complete just one or a few HITs while a small number
of workers do most of the HITs in a batch (see Figure 6.1). While this distribution has been
repeatedly observed in a variety of settings, we argue in the following that it is hardly the
optimal case from a batch latency point of view.
As shown in previous work [62], long batches of Human Intelligence Task (HITs) submitted
to crowdsourcing platforms tend to attract more workers as compared to shorter batches. A
consequence of the long tail distribution of the workers, however, long batches tend to attract
fewer workers towards their end—that is, when only a few HITs are left—as fewer workers are
willing to engage with the almost-completed batch. In this case, it is particularly important
that current workers continue to do as much work as possible before they drop out and prompt
the hiring of new workers for the remaining HITs. In addition, when workers become scarce
(e.g., when the demand is high), such turnovers can become a serious obstacle to rapid batch
completion.
In this chapter, we will explore worker retention as a technique that can be used to improve
batch execution time. For that we introduce a set of pricing schemes designed to improve the
retention rate of workers working in long batches of similar tasks. We show how increasing or
decreasing the monetary reward over time influences the number of tasks a worker is willing
to complete in a batch, as well as how it influences the overall latency. We compare our new
pricing schemes against traditional pricing methods (e.g., constant reward for all the HITs
in a batch) and empirically show how certain schemes effectively function as an incentive
89
Chapter 6. Human Intelligence Task Retention
Scale−up
Scale−out0
10
20
30
40
50
0 5 10 15 20Worker ID
Num
ber
of T
asks
Sub
mitt
ed
Figure 6.1 – The classic distribution of work in crowdsourced tasks follows a long-tail distribu-tion where few workers complete most of the work while many workers complete just one ortwo HITs.
for workers to keep working longer on a given batch of HITs. Our experimental results show
that the best pricing scheme in terms of worker retention is based on punctual bonuses paid
whenever the workers reach predefined milestones.
In summary, the main contributions presented in this chapter are:
• A novel crowdsourcing optimization problem focusing on retaining workers longer in
order to minimize the execution time of long batches.
• A set of new incentives schemes focusing on making individual workers more engaged
with a given batch of HITs in order to improve worker retention rates.
• An open-source software library to embed the proposed schemes inside current HIT
interfaces1.
• An extensive experimental evaluation of our new techniques over different tasks on a
state-of-the-art crowdsourcing platform.
The rest of the chapter is structured as follows: In section 6.2 we formally define the problem
and introduce different pricing schemes to retain workers longer on a set of HITs given a
fixed monetary budget. section 6.3.2 presents empirical results comparing the efficiency
of our different pricing schemes and discussing their effect on crowd retention and overall
latency, followed by a discussion in section 6.4. Finally, we review related work focusing on
contributions related to pricing schemes for crowdsourcing platforms and their effects on the
behavioral patterns of the workers. before concluding in section 6.6.
1A library based on the Django framework available at: https://github.com/XI-lab/BonusBar
90
6.2. Worker Retention Schemes
6.2 Worker Retention Schemes
Our main retention incentive is based on compensation of workers engaged in a batch of tasks,
using monetary bonuses and qualifications. We start this section by formally characterizing
our problem below. We then introduce our various pricing schemes, before describing the
visual interface we implemented in order to inform the workers of the monetary rewards, and
the different types of tasks we considered for the HITs.
6.2.1 Problem Definition
Given a fixed retention budget B allocated to pay workers w1, . . . , wm to complete a batch of
n analogous tasks H = {h1, . . . ,hn}, our task is to allocate B for the various HITs in the batch
in order to maximize the average number of tasks completed by the workers. More formally,
our goal is to come up with a function b(h) which, for each h j ∈ H gives us the optimal reward
upon completion of h j such as to maximize the average number of tasks completed by the
workers, i.e.:
b(h)opt = argmaxb(h)
m−1n−1m−1∑i=0
n−1∑j=0
1C (wi ,h j ,b(h j )) (6.1)
where 1C (wi ,h j ,b(h j )) is an indicator function equal to 1 when worker wi completed task h j
under rewarding regime b(h), and to 0 otherwise. For simplicity, we assume in the following
that workers complete their hits sequentially, i.e., that ∀hi ,h j ∈ H , hi is submitted before h j
if i < j , though they can drop out at any point in time in the batch.
6.2.2 Pricing Functions
Fixed Bonus. The standard pricing scheme used in micro-task crowdsourcing platforms like
AMT is uniform pricing. Under this regime, the worker receives the same monetary bonus for
each completed task in the batch:
b(hi ) = B
|H | ∀ hi ∈ H (6.2)
Training Bonus. Instead of paying the same bonus for each task in a batch, one might try to
overpay workers at the beginning of the batch in order to make sure that they do not drop out
early as is often the case. This scheme is especially appealing for more complex tasks requiring
the workers to learn some new skill initially, making the first HITs less appealing due to the
initial overhead. This scheme allows the requester to compensate the implicit training phase
by initially fixing a high hourly wage despite the low productivity of the worker. Many different
reward functions can be defined to achieve this goal. In our context, we propose a linearly
91
Chapter 6. Human Intelligence Task Retention
decreasing pricing scheme as follows:
b(hi ) = B
|H | +(⌈ |H |
2
⌉− i
)· · ·
(B
|H | ·2
|H |)
(6.3)
where we add to the average HIT reward B/|H | a certain bonus payment increment (i.e., the
last term of the equation) a certain number of times based on the current HIT in the batch.
The general idea behind this scheme is to distribute the available budget in a way that HITs
are more rewarded at the beginning and such that the bonus incrementally decreases after
that. One potential advantage of this pricing scheme is the possibility to attract many workers
to the batch due to the initial high pay. On the other hand, retention may not be optimal since
workers could drop out as soon as the bonus gets too low.
Increasing Bonus. By flipping the (+) sign in Equation 6.3 into a (−), we obtain the opposite
effect, that is, a pricing scheme with increasing reward over the batch length. That way, the
requesters are overpaying workers towards the end of the batch instead of at the beginning.
This approach potentially has two advantages: First, as workers get increasingly paid as they
complete more HITs in the batch, they might be motivated to continue longer in order to
complete the most rewarding HITs at the very end of the batch. Second, workers get rewarded
for becoming increasingly trained in the type of task present in the batch. On the other hand,
a possible drawback of this scheme is the fairly low initial appeal of the batch due to the low
bonuses granted at first.
Milestone Bonus. In all the previous schemes, bonuses are attributed after each completed
HIT. However, depending on the budget and the exact bonus function used, the absolute
value of the increments can be very small. To generate bigger bonuses, one could instead
try to accumulate increments over several HITs and distribute bonuses occasionally only.
Following this intuition, we introduce in the following the notion of milestone bonuses. Under
this regime, an accumulated bonus is rewarded punctually after completing a specific number
of tasks. For a fixed interval I , I <= n, we formulate this scheme using the following function:
b(hi ) ={ ⌈
B ·I|H |
⌉if i mod I = 0
0 otherwise(6.4)
Qualifications. In addition to the monetary reward that is offered at each interval, the re-
quester can define a qualification level that can be granted after each milestone. Qualifications
are a powerful incentive as they constitute a promise on exclusivity for future work.
92
6.2. Worker Retention Schemes
Figure 6.2 – Screenshot of the Bonus Bar used to show workers their current and total reward.
Figure 6.3 – Screenshot of the Bonus Bar with next milestone and bonus.
Random Bonus. An additional scheme that we consider is the attribution of a bonus drawn
once at random2, from a predefined distribution of the total retention budget B . In particular,
we consider the Zipf distribution in order to create a lottery effect so that a worker can get a
high bonus at any point while progressing through the batch.
6.2.3 Visual Reward Clues
On current micro-task crowdsourcing platforms such as Amazon Mechanical Turk, one can
implement the above rewarding schemes by allocating bonuses for the HITs. Hence, workers
complete HITs in a batch in exchange of to the usual fixed reward, but get a bonus that possibly
varies from one HIT to another. In order to make this scheme clear to the workers, we decided
to augment the HIT interface with a Bonus Bar, an open-source toolkit that requesters can
easily integrate with their HITs3. Figure 6.2 gives a visual rendering of the payment information
displayed to a worker completing one of our HITs.
6.2.4 Pricing Schemes for Different Task Types
We hypothesize that the pricing schemes proposed above perform differently based on the task
at hand. In that sense, we decided to address three very different types of tasks and to identify
the most appropriate pricing scheme for each type in order to maximize worker retention.
The first distinction we make for the tasks is based on their length: Hence, we differentiate
short tasks that only require few seconds each (e.g., matching products) and longer tasks that
require one minute or more (e.g., searching the Web for a customer service phone number).
Note that in any case we only consider micro-tasks, that is, tasks requiring little effort to be
completed by individuals and that can be handled in large batches.
The second distinction we make is based on whether or not the task require some sort or initial
training. The example we decided to pick for this work is the classification of butterfly images
in a predefined set of classes. We assume that at the beginning of the batch the worker is not
confident in performing the task and repeatedly needs to check the corresponding Wikipedia
2The attributed bonus value is removed from the distribution’s list to insure that the budget limit is met.3Specifically, AMT requesters can use the toolkit by means of the ExternalQuestion data structure: That is, an
externally hosted Web form which is embedded in the AMT webpages.
93
Chapter 6. Human Intelligence Task Retention
Butterfly Classification Customer Service Phone Item Matching
0
10
20
30
40
50
0
5
10
15
20
0
10
20
30
40
50
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50Worker ID
#Tas
ks S
ubm
itted
Scheme Fixed Bonus Training Bonus Increasing Bonus Milestone Bonus Random Bonus
Figure 6.4 – Effect of different bonus pricing schemes on worker retention over three differentHIT types. Workers are ordered by the number of completed HITs.
Batch Type #Workers #HITs Base Budget Bonus Budget Avg. HIT Time Avg. Hourly RateItem Matching 50 50 $0.5 $0.5 22sec $5.3/hrButterfly Classification 50 50 $0.5 $0.5 15sec $9.4/hrCustomer Care Phone Number Search 50 20 $0.2 $0.4 78sec $2.2/hr
Table 6.1 – Statistics for the three different HIT types.
pages in order to correctly categorize the various butterflies. After a few tasks, however, most
worker will have assimilated the key differentiating features of the butterflies and will be able
to perform the subsequent tasks much more efficiently. For such tasks, we expect the training
bonus scheme to be particularly effective since it overpays the worker at the beginning of the
batch as he/she is spending a considerable amount of time to complete each HIT. After the
worker gets trained, one can probably lower the bonuses while still maintaining the same
hourly reward rate.
6.3 Experimental Evaluation
6.3.1 Experimental Setup
In order to experimentally compare the different pricing schemes we introduced above, we
consider three very different tasks:
• Item Matching: Our first batch is a standard dataset of HITs (already used in [164])
asking workers to uniquely identify products that can be referred to by different different
names (e.g., ‘iPad Two’ and ‘iPad 2nd Generation’).
• Butterfly Classification: This is a collection of 619 images of six types of butterflies:
Admiral, Black Swallowtail, Machaon, Monarch, Peacock, and Zebra [103]. Each batch
of HITs uses 50 randomly selected images from the collection that are presented to the
workers for classification.
94
6.3. Experimental Evaluation
• Customer Care Phone Number Search: In this batch, we ask the workers to find the
customer-care phone number of a given US-based company using the Web.
Our first task is composed of relatively simple HITs that do not require the workers to leave the
HIT page but just to take a decision based on the information displayed. Our second task is
more complex as it requires to classify butterfly images into predefined classes. We assume
that the workers will not be familiar with this task and will have to learn about the different
classes initially. In that sense, we provide workers with links to Wikipedia pages describing
each of the butterfly species. Our third task is a longer task that requires no special knowledge
but rather to spend some time on the Web to find the requested piece of information.
section 6.2.4 gives some statistics for each task, including the number of workers and HITs
we considered for each task (always set to 50 for both), the base and bonus budgets, and the
resulting average execution times and hourly rates. All the tasks were run on the AMT platform.
Our main experimental goals are i) to observe the impact of our different pricing schemes
on the total number of tasks completed the workers in a batch (worker retention) and ii) to
compare the resulting batch execution times. Hence, the first goal of our experiments is not
to complete each batch but rather to observe how long workers keep working on the batch.
Towards that goal, we decided to recruit exactly 50 distinct workers for each batch, and do not
allow the workers to work twice on a given task. We build the backend such that each worker
works on his/her HITs in isolation without any concurrency. This is achieved by allowing 50
repetitions per HIT and recording the worker Id the first time the HIT is accepted, once the
count of Ids reaches 50, any new comer is asked not to accept the HIT. All batches were started
at random times during the day and left online long enough to alleviate any effect due to the
timezones.
6.3.2 Experimental Results
Worker Retention. Figure 6.4 shows the effect of the different pricing schemes on worker
retention for the different types of HITs we consider in this work. The first observation we
can make is that the pricing scheme based on the Milestone Bonus that grants rewards when
reaching predefined goals performs best in terms of worker retention: more workers complete
the batch of tasks as compared to other pricing schemes over all the different task types.
Another observation is that in the Butterfly Classification task the training bonus pricing
scheme retains workers better than the increasing or the fixed bonus scheme. This supports
our assumption that overpaying workers at the beginning of the batch while they are learning
about the different butterfly classes helps them feeling rewarded for the learning effort and
helps keeping them working on the batch longer.
On the other hand, the increasing pricing scheme performed worse both for the Item Matching
and the Butterfly Classification batches. This is probably the case as workers did feel underpaid
for the work they were doing and preferred to drop the batch before its end.
95
Chapter 6. Human Intelligence Task Retention
Butterfly Classification Customer Care Phone Item Matching
0
50
100
150
0 10 20 30 40 50 5 10 15 20 0 10 20 30 40 50Task Submission Sequence
Task
Exe
cutio
n T
ime
(in s
ec)
Category Long Medium Short
Figure 6.5 – Average of the HITs execution time with standard error ordered by their sequencein the batch. Results are grouped by worker category (long, medium and short term workers).In many cases, the Long term workers improve their HIT time execution. This is expected tohave a positive impact on the overall batch latency.
The final comment is about the fixed pricing scheme: This shows bad performance in terms
of worker retention over all the task types we have considered. Note that this is the standard
payment scheme used in paid micro-task crowdsourcing platforms like AMT where each HIT in
a batch is rewarded equally for everyone independently on how many other HITs the workers
has performed in the batch.
Learning Curve. We report on how the execution time varies across the different task types
in Figure 6.5. We group the results based on three different classes of workers: a) the Short
category, which includes workers having completed 25% or less tasks in the batch, b) the
Medium category, which includes workers having completed between 25% and 75% of the
HITs in the batch, and c) the Long category, which includes those workers who completed
more than 75% of the tasks.
From the results displayed in Figure 6.5, we observe a significant learning curve for the Butterfly
Classification batch: On average, the first tasks in the batch require workers a substantially
longer time to complete as compared to the final ones. For the Customer Care Phone Number
Search batch, we see that the task completion time varies from HIT to HIT. We also note that
workers who remained until the end of the batch are becoming slightly faster over time. The
Item Matching batch shows a similar trend, where tasks submitted towards the end of the
batch require on average less time than those submitted initially. Across the different types
of tasks, we also note that workers who are categorized as Short always start slower than
others on average (i.e., workers dropping out early are also slower initially). This is hence an
interesting indicator of potential drop-outs.
96
6.3. Experimental Evaluation
●
Short Medium Long
0.00
0.25
0.50
0.75
1.00
10 20 30 50#Tasks submitted
Aver
age
prec
isio
n pe
r wor
ker
Category Short Medium Long
Figure 6.6 – Overall precision per worker and category of worker for the Butterfly Classificationtask (using Increasing Bonus).
These results are particularly important for our goal of improving latency, since the retained
workers tend to get faster with new HITs performed. This gain is expected to have a direct
impact on the overall execution time of the batch. Next, we check whether this has an impact
on the quality of the submitted HITs.
Impact on Work Quality. We report on the quality of the crowdsourced results in Figure 6.6.
We observe that the average precision of the results does not vary across workers who perform
many or few tasks. We observe however that the standard deviation is higher for the workers
dropping early than for those working longer on the batches. In addition, those workers who
perform most of the HITs in the batch never yield low precision results (the bottom right of
the plot is empty). This could be due to a self-selection phenomenon through which workers
who perform quite badly at the beginning of the batch decide to drop out early.
6.3.3 Efficiency Evaluation
In this final experiment, we evaluate the impact of our best approach (Milestone Bonus) on
the end-to-end execution of a batch of HITs, and we compare with a) the classical approach
with no bonus, and b) using the bonus budget to increase the base reward. In order to get
independent and unbiassed results, we decided to create a new task for this experiment4,
which consists in correcting 10 english essays from the ESOL dataset [170]. We run the three
batches on AMT, each having 10 HITs and requiring 3 repetitions, that is, 3 entries are required
from different workers for each HIT. A summary of our setting is shown in Table 6.2. The three
setups differ as follows:
• Batch A(Milestones): Workers who select Batch A are presented with the interface
4In the previous set of experiments, we hired more than 450 distinct workers.
97
Chapter 6. Human Intelligence Task Retention
displaying the Bonus Bar configured with a bonus at 3, 6 and 10 HITs milestones offering
respectively ($0.2), ($0.4),($0.8) bonuses for a maximum retention budget of $1.4*3=$4.2.
• Batch B(Classic): Workers who select Batch B are presented with a classical interface
and receive a fixed reward of $0.2 for each submission they make.
• Batch C(High Reward): Workers who select Batch C are presented with a classical
interface. Here, we use the bonus budget to increase the base reward, thus, workers will
receive a fixed reward of $0.34 for each submission they make.
We perform 5 repeated runs as follows: a) we start both batches A and B at the same time and
let them run concurrently – this measures the sole effect of retention, b) batch C was launched
separately since it offers a higher base reward and might influence A and B5.
Batch Type #HITs #Repetitions Reward Base Budget Bonus Budget Avg. HIT Time Avg. Hourly RateA (Milestones) 10 3 $0.2 $6 $4.2 268sec $5.7/hrB (Classic) 10 3 $0.2 $6 N/A 310sec $2.4/hrC (High Reward) 10 3 $0.34 $10.2 N/A 302sec $3.9/hr
Table 6.2 – Statistics of the second experimental setting – English Essay Correction
Figure 6.7 shows the results of 5 repeated experiments of the above settings. We report the
overall execution time after each batch finishes (i.e., when all the 3*10 HITs are submitted), the
budget used by each run, the number of workers involved and how many HITs each worker
submitted. We can observe the effects of retention in batch A as it involves less workers who
submit a greater number of HITs on average as compared to batches B and C. From a latency
perspective, batch A consistently outperforms batch B’s execution time, on average by 33%,
thanks to the retention budget in use. While batch C is faster overall – which can be explained
by the fact that it attracts more workers due to its high reward – it uses the entirety of its budget,
as compared to A that only uses $2.44 on average.
6.4 Discussion
To summarize, the main findings resulting from our experimental evaluation are:
• Giving workers a punctual bonus for reaching a predefined objective defined as a given
number of tasks improves worker retention.
• Overpaying workers at the beginning of a batch is useful in case the tasks require an
initial training: Workers feel rewarded for their initial effort and usually continue working
for a lower pay after the learning phase.
• While retention comes at a cost, it also improves latency. Based on our experiments
comparing different setups over multiple runs, we observe that the bonus scheme
involved less workers who perform more tasks on average. This property is particularly
important when the workforce is limited on the crowdsourcing platform.
5To minimize timezones effects we run the batch at a similar time of the day as A and B
98
6.5. Related Work on Worker Retention and Incentives
Number of taskssubmitted per worker
Number ofworkers per run
Budget(in USD)
Time(in minutes)
2
4
6
8
10
5
10
15
6
7
8
9
10
100
200
300
400
500
A B C A B C
Setup A B C
Figure 6.7 – Results of five independent runs of A, B and C setups. Type A batches include theretention focused incentive while Type B is the standard approach using fixed pricing, Batch Cuses a higher fixed pricing – but leveraging the whole bonus budget.
6.5 Related Work on Worker Retention and Incentives
A number of recent contributions studied the effect of monetary incentives on crowdsourcing
platforms. In [112], Mao et al. compared crowdsourcing results obtained using both volun-
teers and paid workers. Their findings show that the quality of the work performed by both
populations is comparable, while the results are obtained faster when the crowd is financially
rewarded.
Wang et al. [163] looked at pricing schemes for crowdsourcing platforms focusing on the
quality dimension: The authors proposed methods to estimate the quality of the workers and
introduce new pricing schemes based on the expected contribution of the workers. While also
proposing an adaptive pricing strategy for micro-task crowdsourcing, our work focuses instead
on retaining the crowd longer on a given batch of tasks in order to improve the efficiency of
individual workers and to minimize the overall batch execution time.
Another recent piece of work [101] analyzed how task interruption and context switching
decreases the efficiency of workers while performing micro-tasks on a crowdsourcing platform.
This motivates our own work, which aims at providing new incentives to convince the workers
to keep working longer on a given batch of tasks.
Chandler and Horton [38] analyzed (among others) the effect of financial bonuses for crowd-
sourcing tasks that would be ignored otherwise. Their results show that monetary incentives
99
Chapter 6. Human Intelligence Task Retention
worked better than non-monetary ones given that they are directly noticeable by the workers.
In our own work, we display bonus bars on top of the task to inform the worker on his/her
hourly rate, fixed pay, and bonuses for the current HITs.
Recently also, Singer et al. [147] studied the problem of pricing micro-tasks in a crowdsourcing
marketplace under budget and deadline constraints. Our approach aims instead at varying
the price of individual HITs in a batch (i.e., by increasing or decreasing the monetary rewards)
in order to retain workers longer.
Faradani et al. [62] studied the problem of predicting the completion of a batch of HITs and at
its pricing given the current marketplace situation. They proposed a new model for predicting
batch completion times showing that longer batches attract more workers. In comparison, we
experimentally validate our work with real crowd workers completing HITs on a micro-task
crowdsourcing platform (i.e., on AMT).
In [113], Mao et al. looked into crowd worker engagement. Their work is highly related to ours
as it aims to characterize how workers perceive tasks and to predict when they are going to
stop performing HITs. The main difference with our work is that [113] looked at a volunteer
crowdsourcing setting (i.e., they used data from Galaxy Zoo where people classify pictures
of galaxies). This is a key difference as our focus is specifically on finding the right pricing
scheme (i.e., the correct financial reward) to engage workers working on a batch of HITs.
Another setting where retaining workers is critical is push crowdsourcing. Push crowdsourcing
[56] is a special type of micro-task platform where the system assigns HITs to selected workers
instead of letting them do any available HIT on the platform. This is done to improve the
effectiveness of the crowd by selecting the right worker in the crowd for a specific type of HIT
based on the worker profile which may include previous HITs history, skills and preferences.
Since attracting the desired workers is not guaranteed, keeping them on the target task is
essential.
On a separate note, this piece of work was also inspired from studies on talent management in
corporate settings. Companies have long realized the shortage of highly qualified workers and
the fierce competition to attract top talents. In that context, retaining top-performing employ-
ees longer constitutes an important factor of performance and growth [22, 119]. Although our
present setting is radically different from traditional corporate settings, we identified many
cases where the crowdsourcing requesters (acting as a virtual employer) could use common
human resources practices. In the following, we particularly investigate practices such as:
training cost, bonuses, and attribution of qualifications [75, 16].
6.6 Conclusions
In this chapter, we addressed the problem of speeding up a crowdsourcing campain by in-
centivizing workers such that they keep working longer on a given batch of HITs. Increased
100
6.6. Conclusions
worker retention is valuable in order to avoid the problem of batch starvation (when only a few
remaining HITs are left in a batch and no worker selects them), or if the workforce is limited
on the crowdsourcing platform (a requester tries to keep the workers longer on his batch). We
defined the problem of worker retention and proposed a variety of bonus schemes in order
to maximize retention, including fixed, random, training, increasing, and milestone-based
schemes. We performed an extensive experimental evaluation of our approaches over real
crowds of workers on a popular micro-task crowdsourcing platform. The results of our ex-
perimental evaluation show that the various pricing schemes we have introduced perform
differently depending of the type of tasks. The best performing pricing scheme in terms of
worker retention is based on milestone bonuses, which are punctually given to the workers
who reach a predefined goal in terms of completed number of HITs.
We also observe that our best bonus schemes consistently outperform the classic fixed pricing
scheme, both in terms of worker retention and efficient execution. The main finding is hence
that it is possible to adopt new pricing schemes in order to make workers stay longer on a
given batch of tasks longer and obtain results faster back from the crowdsourcing platform.
Worker retention is key in terms of efficiency improvement in the context of hybrid human-
machine systems, and a step towards providing crowdsourcing SLAs for pull-crowdsourcing.
For push-crowdsourcing, we will investigate another technique that relies on task scheduling
in the next chapter.
101
7 Human Intelligence Task Scheduling
7.1 Introduction
The backend crowdsourced operators of crowd-powered systems typically yield higher laten-
cies than the machine-processable operators, due to inherent efficiency differences between
humans and machines. This problem can be further amplified by the lack of workers on the
target crowdsourcing platform, and/or, if the workers are shared unequally by a number of
competing requesters – including the concurrent users of the same crowd-powered system.
Moreover, in large enterprise settings, it is common that multiple users with different types of
requests submit queries concurrently through the same meta-requester, and end up compet-
ing among themselves. When this happens, it is necessary to correctly manage requests to
avoid latency being impacted any further. Scheduling is the traditional way of tackling such
problems in computer science, by prioritizing access to shared resources to achieve some
quality of service.
In this chapter, we explore and empirically evaluate scheduling techniques that can be used
to manage the internal operations of a crowdsourcing system. More specifically, we focus on
multi-tenant, crowd-powered systems where multiple batches of Human Intelligence Tasks
(HITs) have to be run concurrently. In order to effectively handle the HIT workload generated
by such systems, we implement and empirically compare a series of scheduling techniques
with the aim of improving the overall efficiency of the system. Specifically, we try to answer
the following questions: “Does known scheduling algorithms exhibit their usual properties
when applied to the crowd?" and “What are the adaptations needed to accommodate the
usual crowd work routine?"
Efficiency concerns have so far mostly been tackled by increasing the price of the HITs or by
repeatedly re-posting the HITs on the crowdsourcing platform[62, 26]. Instead, we propose
the use of a HIT-BUNDLE, that is a group of heterogeneous HITs originating from multiple
clients in a multi-tenant system. This allows to apply HIT scheduling techniques within the
HIT-BUNDLE and to decide which HIT should be served to the next available worker. While
our focus is on efficiency, the proposed techniques are still compatible with other quality
optimization approaches, merging the two aspects is left outside of the scope of this work.
103
Chapter 7. Human Intelligence Task Scheduling
7.1.1 Motivating Use-Cases
Example use case 1: reCAPTCHA[159] is a mechanism that protects websites from bots by
presenting a text transcription challenge that only a human can pass. In counterpart, the
collected transcriptions are used to digitize books. If a similar service was open to external
clients having digitization requests, the system would serve the chopped scans of books
according to a scheduling strategy that meets the clients requirements, e.g., throughput
(words/minute), or a deadline target.
Example use case 2: A large organization with multiple departments shares a database system
with crowd-powered user defined functions (UDFs). In our scenario, the marketing and sales
departments issue a series of queries to their system (see Listing 7.1), generating five different
HIT batches on the crowdsourcing platform. Note that with current systems, distinct queries
would generate isolated concurent batches with inter-dependent performances.
1 -- Marketing Department2 -- Q1:3 SELECT * FROM clients r4 WHERE isFemale(r.document_scan)5 AND r.city = ‘‘Philadelphia ’’6 -- Q2:7 SELECT hairColor(p.picture), COUNT (*)8 FROM person p9 GROUP BY hairColor(p.picture)
10
11 -- Sales Department (High Priority)12 -- Q3:13 SELECT * FROM person p14 WHERE isFemale(p.picture)15 AND p.martial_status = ‘‘married ’’16 -- Q4:17 SELECT *, findCustomerCarePhone(c.name)18 FROM clients c19 ORDER BY c.sales DESC20 -- Q5:21 SELECT *, tagESP(b.scan , 2)22 FROM business_cards b
Listing 7.1 – Example queries of a crowd-powered DBMS.
We make the following observations:
• Q1 and Q3 use the same query operators.
• Q2 and Q3 use the same input field.
104
7.2. Scheduling on Amazon MTurk
• While queries with same UDFs can be merged, in this case Q3 should run with a higher
priority.
• Q4 needs to crowdsource the records of customers with the highest sales first.
• Q5 is a UDF that implements an ESP[156] mechanism for tagging pictures hence requir-
ing live collaboration of two workers.
7.1.2 Objective
We believe that posting HITs individually on a shared crowdsourcing platform, as they get
generated by the crowd-powered DBMS, is suboptimal. Rather, we propose in the following to
manage their execution by regrouping the HITs from the different queries into a single batch
that we call a HIT-BUNDLE. Hence, our goal is to create an intermediate scheduling layer that
has the following objectives:
• improving the overall execution time of the generated workload, while
• ensuring fairness among the different users of the system by equitably balancing the
available workforce, and
• avoiding starvation of smaller requests.
7.1.3 Contributions
We experimentally compare the efficiency of various crowd scheduling approaches with real
crowds of workers working on a micro-task crowdsourcing platform by varying the size of the
crowd, the ordering and priority of the tasks, and the size of the HIT batches. In addition, we
take into account some of the unique characteristics of the crowd workers such as the effect
of context switching and work continuity. Our experimental setting include both controlled
settings with a fixed number of workers involved in the experiments as well as real-world
deployments using HIT workloads taken from a commercial crowdsourcing platform log.
The results of our experimental evaluation indicate that using scheduling approaches for
micro-task crowdsourcing can lead to more efficient multi-tenant crowd-powered DBMSs by
providing faster results and minimizing the overall latency of high-priority work published on
the crowdsourcing platform.
7.2 Scheduling on Amazon MTurk
The AMT Platform: In this work, we aim at comparing approaches to improve as much as
possible the platform efficiency given the current workload of HITs from a certain requester.
We chose to design an experimental framework on top of AMT as 1) it is the currently the
most popular micro-task crowdsourcing platform, 2) there is a continuous flow of workers
and requesters completing and publishing HITs on the platform, and 3) its activity logs are
available to the public [76].
105
Chapter 7. Human Intelligence Task Scheduling
0.00
0.25
0.50
0.75
1.00
Jan 01 Jan 15 Feb 01 Feb 15 Mar 01 Mar 15 Apr 01Time (Day)
Cou
nt (
Nor
mal
ized
)
(a) Batch distribution per Size − Most of the Batches present on AMT have 10 HITs or less.
0.00
0.25
0.50
0.75
1.00
Jan 01 Jan 15 Feb 01 Feb 15 Mar 01 Mar 15 Apr 01Time (Day)
Thr
ough
put (
Nor
mal
ized
) (b) Cumulative Throughput per Batch Size − The overall platform throughput is dominated by larger batches.
Tiny[0,10]
Small[10,100]
Medium[100,1000]
Large[1000,Inf]
Figure 7.1 – An analysis of three months activity log on Amazon MTurk January-March 2014obtained from mturk-tracker.com [76] The crawler frequency is every 20 minutes, hence itmight miss some batches. All HITs considered in this plot are rewarded $0.01. Throughputmeasured in HIT/minute for HIT batches of different sizes.
Major Requesters and Meta Requesters: In crowdsourcing platforms, businesses that heavily
rely on micro-task crowdsourcing for their daily operations end up competing with themselves:
If a requester runs concurrent campaigns on a crowdsourcing platform, these will end up
affecting each other. For example, a newly posted large batch of HITs is likely to get more
attention than a two days old batch waiting to be finished with few HITs remaining (see below
for an explanation on that point).
7.2.1 Execution Patterns on Micro-Task Crowdsourcing Platforms
One of the common phenomena in micro-task crowdsourcing is the presence of long-tail
distributions: In a batch of HITs, the bulk of the work is completed by a few workers who
perform most of the tasks while the rest is performed by many different workers who perform
just a few HITs each (see, e.g., [64]). We observe this property in our experiments as well.
Figure 7.3b shows the amount of work (number of HITs submitted during the experiment)
performed by each worker in the crowd during an experiment involving more than 100 crowd
workers (see Section 7.2.2) containing heterogeneous HITs. We can see a long-tail distribution
where few workers perform most of the tasks while many perform just a few tasks.
Another example of long-tail distribution can be observed when considering the throughput:
Large batches are completed at a certain speed by the crowd, up to a certain point when few
HITs are left in the batch. These final few HITs take a much longer time to be completed as
compared to the majority of HITs in the batch. Such a batch starvation phenomenon has been
observed in a number of recent reports, e.g., in [62, 161] where authors observe that the batch
completion time depends on its size and on HIT pricing. HIT completion starts off quickly
but then loses some momentum. A plot depicting this effect on the AMT platform is shown in
Figure 7.1, where we observe that large batches dominate the throughput of a crowdsourcing
platform even if the vast majority of the running batches are very small (less than 10 HITs).
In that sense, large batches of tasks are able to systematically yield higher throughputs as
more crowd workers can work on them in parallel. We can conjecture that these phenomena
106
7.2. Scheduling on Amazon MTurk
Crowd LayerHIT-Bundle Manager
Multi-TenantCrowd-powered DBMS
CrowdsourcingPlatform
ProgressMonitor API
HIT Scheduler
Human Workers
c1 a1b3..
QueueCrowdsourcing
App
HIT Collection and RewardHIT
Results Aggregator
HIT Manager
Scheduler
External HIT
Page
Batch A $$
Batch B $$$
Batch C $
..
Batch CatalogBatch Creation
and Update
Batch Merging
StatusMETA
DBMS
CrowdSqlBatch Input
MergerResource Tracker
config_file
Figure 7.2 – The role of the HIT Scheduler in a Multi-Tenant Crowd-Powered System Architec-ture (e.g., a DBMS).
are partially due to the preference of the crowd towards large batches. Indeed, the workers
tend to explore new batches with many HITs, since they have a high reward potential, without
requiring to search for and select a new HIT context. This is confirmed by our experimental
results (see section 7.4).
Moreover, we can see in Figure 7.3a that the overall throughput of the system increases linearly
with the number of workers involved in a set of batches.
7.2.2 A Crowd-Powered DBMS Scheduling Layer on top of AMT
We now describe the scheduling layer we established on top of AMT to perform our experimen-
tal comparison of different HIT Scheduling techniques. This layer can be used by multi-tenant
crowd-powered DBMSs to efficiently execute user queries.
HIT-BUNDLE: We study scheduling techniques applied to the crowd on AMT by introducing
the notion of a HIT-BUNDLE, that is, a batch container where heterogeneous HITs of com-
parable complexity and reward get published continuously by a given AMT requester, or, in
our case, by the Crowd-powered DBMS. In this section, we describe the main components
of a Multi-Tenant Crowd-Powered DBMS that uses scheduling techniques to optimize the
execution of batches of HITs. Then, we show that having a HIT-BUNDLE not only permits to
apply different scheduling strategies but it also produces a higher overall throughput (see
Section 7.4.2).
Framework: Our general framework is depicted in Figure 7.2. The input comes from the
different queries submitted to the system. The query optimizer has the role of deciding what
to ask to the crowd. Subsequently, the HIT Manager generates HIT batches together with
a monetary budget to be spent to obtain the results from the crowd. In traditional crowd-
powered systems, these batches are directly sent to the crowdsourcing platform.
In this work, we consider and experimentally evaluate the performance of an addition compo-
nent, the HIT Scheduler, which aims at improving the execution time of selected HITs. Once
107
Chapter 7. Human Intelligence Task Scheduling
0
25
50
75
100
18:26 18:28 18:30 18:32 18:34 18:36Time
Wor
ker
Cou
nt −
Thr
ough
put H
ITs/
Min
ute
HITs/Minute
Number of workers
(a) Throughupt vs #Workers
0
50
100
0 10 20 30#HITs submitted
Wor
ker
(b) Work Distribution
Figure 7.3 – Results of a crowdsourcing experiment involving 100+ workers concurrentlyworking in a controlled setting on a HIT-BUNDLE containing heterogeneous HITs (B1-B5, seesection 7.4) scheduled with FS. (a) Throughput (measured in HITs/minute) increases with anincreasing number of workers involved. (b) Amount of work done by each worker.
new HIT batches are generated, they are put in a container of tasks to-be-crowdsourced. The
scheduler is constantly monitoring the crowd workers and assigning to individual workers the
next HIT to work on based on a scheduling algorithm. More specifically, the HIT Scheduler
collects in its Batch Catalog the set of HIT batches generated by the HIT Manager together
with their reward and priorities.
Next, the HIT-BUNDLE Manager creates a crowdsourcing campaign on AMT. Based on the
scheduling algorithm adopted, a HIT queue (specifying which HIT must be served next in the
HIT-BUNDLE) is generated and periodically updated. As soon as a worker is available, the HIT
Scheduler serves the first element in the queue. When HITs are completed, the results are
collected and sent back to the DBMS for aggregation and query answering. Workers are able
to return HITs they find too boring or poorly paid and, obviously, to leave the system at any
point in time. In these cases, the Scheduler takes responsibility of updating the queue and to
reschedule uncompleted HITs.
Next, we describe a number of scheduling algorithms that can be used to generate the HIT
queue for crowdsourcing platforms in section 7.3 and experimentally compare their perfor-
mance in section 7.4.
7.3 HIT Scheduling Models
The rest of this chapter focuses on experimentally evaluating scheduling approaches for crowd-
sourcing platforms within the framework presented above in Section 7.2.2. We revisit below
common scheduling approaches used by popular resource managers in shared environments,
and discuss their advantages and drawbacks when applied to a crowdsourcing platform setting
which, as we show in section 7.4, presents several new dimensions to be taken into account
compared to traditional CPU scheduling.
108
7.3. HIT Scheduling Models
7.3.1 HIT Scheduling: Problem Definition
We now formally define the problem of scheduling HITs generated by a multi-tenant crowd-
based system on top of a crowdsourcing platform.
A query r submitted to the system and including crowd-powered operators generates a batch
B j of HITs. We define a batch B j = {h1, ..,hn} as a set of HITs hi . Each batch has additional
metadata attached to it: A monetary budget M j to be spent for its execution and a priority
score p j with which it should be completed: Batches with higher priority should be executed
before batches with lower priority. Thus, if a high-priority batch is submitted to the platform
while a low-priority batch is still uncompleted, the HITs from the high-priority batch are to be
scheduled to run first.
The problem of scheduling HITs takes as input a set of available batches {B1, ..,Bn} and a crowd
of workers {w1, .., wm} currently active on the platform, and produces as output an ordered list
of HITs from {B1, ..,Bn} to be assigned to workers in the crowd by publishing them as a single
HIT-BUNDLE. Once a worker wi is available, the system assigns him/her the first task in the
list as decided by the scheduling algorithm.
Scheduling may need to be repeated over time to update the HIT execution queue. Such
re-scheduling operations are necessary, for example when a worker fails to complete some of
his/her HIT or when a new batch of HITs is submitted by one of the clients.
In this way, we obtain some hybrid pull-push behavior on top of AMT as the workers partici-
pating in the crowd sourcing campaign are shown HITs computed by the scheduler. Workers
are still free to decline the HIT, ask for another one, or simply seek for another requester on
AMT.
Worker Context Switch
From the worker perspective, scheduling can lead to randomly alternating task types that
a single worker might receive. In such a situation, the worker has to adapt to the new task
instructions, interface, question etc, and this could be penalizing (see our related work section
7.5). This overhead is called context switch. One of the goals of task scheduling is to improve
the efficiency of each worker by mitigating her context switches.
7.3.2 HIT Scheduling Requirement Analysis
Next, we describe which requirements should be taken into account when applying scheduling
in a crowdsourcing setting. We then use some of these requirement to customize known
scheduling techniques for the crowd.
(R1) Runtime Scalability: Unlike parallel DBMS schedulers, where the compiled query plan
109
Chapter 7. Human Intelligence Task Scheduling
dictates where and when the operators should be executed [150], Crowd-Powered
DBMSs are bound to adopt a runtime scheduler that a) dynamically adapts to the
current availability of the crowd, and b) scales to make real-time scheduling decisions as
the work demand grows higher. A similar design consideration is adopted by YARN[153],
the new Hadoop resource manager.
(R2) Fairness: An important feature that any shared system should provide is fairness across
the users of the system. By taking control of the HIT-BUNDLE scheduling, the crowd-
powered system acts as the load balancer of the currently available crowd and the
remaining HITs in the HIT-BUNDLE. For example, the scheduler should provide a steady
progress to large requests without blocking – or starving, the smaller requests.
(R3) Priority: In a multi-tenant System, some queries have a higher priority than others.
For this reason, HITs generated from the queries should be scheduled accordingly. In
the case of high-priority requests, one of the standard SLA requirements is the job
deadline that the requester specifies. In a crowdsourcing scheduling setting, as workers
are not committed to the platform and can leave at any point in time, a Crowd-Powered
DBMS scheduler should be best-effort, that is, the system should do its best to meet the
requester priority requirements without any hard guarantee.
(R4) Multiple Resources: Crowd-Powered UDFs can be designed to include specific require-
ments on resources, e.g., qualifications and number of workers. In that sense, we
consider the very common case of collaborative tasks where multiple crowd workers
are needed concurrently. An example of such a task is the ESP game [158], where two
human individuals have to tag images collaboratively. This problem is analogous to the
gang scheduling problem [63] in machine-based systems, where an algorithm can only
run when a given number of CPUs are reserved for that purpose.
(R5) Need for Speed: In hybrid human-machine systems, the crowd-powered modules are
usually the bottleneck in terms of latency. However, real-time Crowdsourcing is a
necessity for various interactive applications that require human intelligence at scale
to improve what machines can do today. Example applications that require real-time
reactions from the crowd include real-time captioning of speech [98], real-time personal
assistants [102], real-time video filtering [24]. Scheduling HITs belonging to a mixed
workload of real-time and batch jobs is essential to enable real-time crowdsourcing.
(R6) Worker Friendly: Differently from CPUs, people performances are impacted by many
factors including training effects, boringness, task difficulty and interestingness. Schedul-
ing approaches over the crowd should whenever possible take these factors into account.
In this chapter, we experimentally test worker-conscious scheduling methods that aim
at balancing the trade-off between serving similar HITs to workers and providing fair
execution to different HIT batches.
In addition to the machine-specific requirements listed above, we briefly discuss crowd-
specific features that a scheduler needs to take into account when scheduling HITs on crowd-
sourcing platforms.
110
7.3. HIT Scheduling Models
(C1) Laggers: It often happens that HITs are assigned to a worker on a crowdsourcing plat-
form but never get completed [62]. In distributed systems, task execution failures are
usually mitigated by opportunistically duplicating the task on an idle resource. In an
architecture partially powered by micro-tasks, duplicating HITs opportunistically di-
rectly leads to unnecessary monetary costs, especially when the lagging workers end
up doing the task. Instead, the HIT should be released from the lagging worker after a
batch-specific timeout, and only then be reassigned to a new worker.
(C2) Better Resources: Some resources in a shared system might be better than others (some
may be more powerful, consume less energy, etc.) [105]; Likewise, in the crowd, some
workers may be more efficient than others or might provide higher-quality results.
Previous work [56] showed how it is possible to predict the quality of the results of a
specific worker on a specific task. Such approaches can be used as additional evidence
for HIT scheduling but are not in the scope of this work. Instead, we focus on scheduling
approaches to improve latency of certain batches in a setting where selected HIT batches
have high priority and need to be executed before others.
7.3.3 Basic Space-Sharing Schedulers
Crowdsourcing platforms usually operate in a non-preemptive mode, that is, they do not allow
to interrupt a worker performing a task of low priority to have him perform a task of higher
priority with the risk of reneging. 1. In our evaluation we consider common space-sharing
algorithms where a resource (a crowd worker in this case) is assigned a HIT until he/she
finishes it, or returns it uncompleted to the platform.
FIFO
On crowdsourcing platforms, this scheduling has the effect of serving lists of tasks of the same
batch to the workers until they are finished. By concentrating the entire workforce on a single
job until it is done, FIFO provides the best throughput per batch one can expect from the
platform at a given moment in time.
The potential shortcomings of this scheme are as follows: 1) short jobs and high priority jobs
can get stuck behind long running tasks, minimizing the overall efficiency of the crowdsourcing
system, and 2) when a batch has a large number of tasks, assigned workers can potentially get
bored [138].
Shortest Job First (SJF)
Other simple scheduling schemes offer different tradeoffs depending on the requirements of
the multi-tenant system. Shortest Job First (SJF) offers fast turn-around for short HITs, and
can lead to a minimum of a context switch for part of the crowd, since the shortest jobs are
1Unless the high priority task can take up the reneging cost.
111
Chapter 7. Human Intelligence Task Scheduling
either quickly finished or scheduled to the first available workers.
However, SJF is not strategy-proof on current crowdsourcing platforms as the requesters
can lie about the expected HIT execution times. Hence, these schemes should be used
in trusted settings mostly (e.g., in enterprise crowd-DBMSs). Moreover, these schemes do
not systematically interweave tasks from different batches, and thus present also the same
shortcomings as FIFO.
Round Robin (RR)
The previous schemes introduces biases, in the sense that they give an advantage to one batch
over the others. Round Robin removes such biases by assigning HITs from batches in a cyclic
fashion. In this way, all the batches are guaranteed to make regular progress. While Round
Robin ensures an even distribution of the workforce and avoids starvation, it does not meet
one of our requirement (R2) since it is not priority-aware: All the batches are treated equally
with the side effect that batches with short HITs would (proportionally) get more workforce
than longer HITs. Another risk is that a worker might find herself bouncing across tasks and
being forced to continuously switch context, hence loosing time to understand the specific
instructions of the tasks. The negative effect of context switch is evident from our experimental
results (see section 7.4) and should be avoided.
7.3.4 Fair Schedulers
In order to deal with batches of HITs having different priorities while avoiding starvation, we
also consider scheduling techniques frequently used in cluster computing.
Fair Sharing (FS)
Sharing heterogeneous resources across jobs having different demands is a well-known and
complex problem that has been tackled by the cluster computing community. One popu-
lar approach currently used in Hadoop/Yarn is Fair Scheduling (FS) [67]. In the context of
scheduling HITs on a crowdsourcing platform, we borrow this approach in order to achieve fair
scheduling of micro-tasks: Whenever a worker is available, he/she gets a HIT from the batch
with the lowest number of currently assigned HITs which we call r unni ng _t asks. Unlike
Round Robin, this ensures that all the jobs get the same amount of resources (thus being fair).
Algorithm 1 gives the exact way we considered FS in our context.
Weighted Fair Sharing (WFS)
In order to schedule batches with higher priority first (see R2 in Section 7.3.2), weighted fair
scheduling can be used, in order assign a task from the jobs with the least r unni ng _t asks/t ask_pr i or i t y
value. Algorithm1, line 2, gets in that case updated: Sort B by increasing ri /pi . This puts
112
7.3. HIT Scheduling Models
Algorithm 1 Basic Fair Sharing
Input: B = {bi < p1,r1 >, ..,bn < pn ,rn >} set of batches currently queued with priority pi , andnumber of running HITs per batch ri .
Output: HIT hi .1: When a worker is available for a task2: BSor ted = Sort B by increasing ri
3: hi = BSor ted [0].getNextHit()4: return hi
more weight on batches with few running tasks and a high priority.
The following formula gives the fair share of resources (i.e., number crowd workers) allocated
to a HIT batch j with priority score p j and concurrent running batches with priority scores
{p1..pN } at any given point in time
w j =p j∑N1 pi
. (7.1)
7.3.5 Gang Scheduling for Collaborative HITs
A Crowd-powered UDF can be coded to require live collaboration of K workers for each HIT it
creates. A typical example is the design of games with a purpose [156] where the participants
can be hired through a paid crowdsourcing platform (see R3 in Section 7.3.2). This is the
equivalent of Gang Scheduling in the context of system scheduling (e.g., MPI) where a job
will not start if the central scheduler cannot provision the required number of resources (i.e.,
CPUs). Different scheduling approaches are necessary for such HITs, as they require two or
more concurrent workers to be completed.
Naive Gang Scheduling (NGS)
The most common way to achieve gang scheduling is to place a reservation on a resource until
the job acquires all the necessary resources. In order to make this technique applicable to a
crowdsourcing platform, one needs to place a reservation on a worker, making the k-readily
available workers wait for the remaining K -k workers. This approach is suboptimal in our
context since recruiting all the necessary workers has no time guarantee, hence the workers
might incur unacceptable idle time which has a negative impact on their revenue. The idle
effect is however mitigated when there is a sufficient number of workers on the platform.
7.3.6 Crowd-aware Scheduling
In addition to the standard scheduling techniques described above, we also evaluate a couple
of approach aiming at scheduling tasks taking into account the crowd workers need (see R4 in
113
Chapter 7. Human Intelligence Task Scheduling
Algorithm 2 Worker Conscious Fair Share
Input: B = {bi < p1,r1, si >, ..,bn < pn ,rn , sn >} set of batches currently queued with prioritypi , ri number of running HITs, and si concessions initialized to 0.
Input: K = maximum concession thresholdOutput: HIT hi .
1: When a worker w j is available for a HIT2: bl ast = Last batch that w j did // null if it’s a new worker3: BSor ted = Sort B by increasing ri /pi
4: if bl ast == null then5: BSor ted [0].s = 06: return BSor ted [0].getNextHit()7: end if8: for b in BSor ted do9: if b == bl ast then
10: b.s = 011: return b.getNextHit()12: else if b.s < K then13: b.s ++14: continue15: else16: b.s = 017: b.getNextHit()18: end if19: end for
Section 7.2). In that sense, we propose scheduling approaches that offer a tradeoff between
being fair to the batches (by load-balancing the workers) while also being fair to the workers
(by serving HITs with some continuity, if possible, and with minimal wait time).
Worker Conscious Fair Sharing (WCFS)
Worker Conscious Fair Sharing (WCFS) maximizes the likelihood of a worker receiving a
task from a batch he worked on recently, thus avoiding that a workers jumps back and forth
between different tasks (i.e., minimizing context switching). We suggest to achieve this by
having top priority batches concede their positions in favor of one of the next batches in the
queue. Each batch can concede his turn K times, a predefined concession threshold, which is
reset after being scheduled. This approach is the crowd-equivalent of Delay Scheduling [172].
Crowd-aware Gang Scheduling (CGS)
Finally, we propose a crowd-aware version of Gang Scheduling for crowdsourcing platforms
by setting a maximum wait time τ any recruited worker might incur. Whenever the scheduler
decides that a HIT with a gang requirement should be executed, it first checks the expected
114
7.4. Experimental Evaluation
finish time for all the workers currently active on the platform based on the average finish time
of the HITs they are currently performing. If K workers can be available in a window of time
τ= t1 − t0, then the batch is scheduled to start at the beginning of the time window. Hence,
the first worker who gets the task will be joined by the other workers in a maximum of τ time.
We deal with the uncertainty of having the workers quit from the platform after their last HIT
by over provisioning, that is by assigning more workers than required to the collaborative HIT,
and, by giving the task to any worker who becomes available in the target time window. We
call the resulting technique Crowd-Aware Gang Scheduling (CGS).
7.4 Experimental Evaluation
We describe in the following our experimental results obtained by scheduling HITs on the AMTcrowdsourcing platform. The main questions we want to address are the following:
• Do scheduling approaches keep their properties when used for assigning HITs to workers
in the crowd? (section 7.4.3)
• How do different dimensions like batch priority and crowd size affect execution? (sec-
tion 7.4.3)
• How does scheduling for collaborative HITs perform? (section 7.4.3)
• How do worker-aware approaches behave in terms of throughput and latency? (sec-
tion 7.4.4)
• How do scheduling approaches behave on a real deployment over a commercial crowd-
sourcing platform? (section 7.4.4)
• Do larger HIT batches attract more workers in current crowdsourcing platforms? (sec-
tion 7.4.2)
• Do context switch (i.e., working on a sequence of different HIT types) affect worker
efficiency? (section 7.4.2)
As a general experimental setup, we implemented the architecture proposed in Section 7.2.2
on top of AMT’s API. Our implementation and datasets are available as an open-source project
for reproducibility purposes and as a basis for potential extensions2.
7.4.1 Datasets
For our experiments, we created a dataset of 7 batches of varying complexity, sizes, and
reference prices. The data was partly created by us and partly collected from related works;
it includes typical tasks that could have been generated by crowd-powered DBMSs. Table
7.1 gives a summary of our dataset and provides a short description and references when
applicable. We note that for the purpose of our experiments, we vary the batch sizes and prices
according to the setup.
2https://github.com/XI-lab/hitscheduler
115
Chapter 7. Human Intelligence Task Scheduling
ID Dataset DescriptionPriceperHIT
#HITs
Avg.Time
perHIT
B1Customer CarePhone NumberSearch
Find the customer-care phonenumber of a given US-basedcompany using the Web.
$0.07 50 75sec
B2Image Tagging Type all the relevant keywords
related to a picture from theESP game dataset. [158]
$0.02 50 40sec
B3Sentiment Analysis Classify the expressed senti-
ment of a product review (posi-tive, negative, neutral).
$0.05 200 22sec
B4
Type a Short Text This a is study on short memory,where a worker is presentedwith text for a few seconds, thenhe is asked to type it from mem-ory. [155]
$0.03 100 11sec
B5Spelling Correction A collection of short paragraphs
to spell check from StackEx-change.
$0.03 100 36sec
B6
Butterfly Classifica-tion
Classify a butterfly image toone of 6 species (Admiral,Black Swallowtail, Machaon,Monarch, Peacock, and Zebra).[103]
$0.01 600 15sec
B7
Item Matching Uniquely identify products thatcan be referred to by differ-ent names (e.g., ‘iPad Two’ and‘iPad2nd Generation’). [164]
$0.01 96 22sec
Table 7.1 – Description of the batches constituting the dataset used in our experiments.
7.4.2 Micro Benchmarking
The goal of the following micro benchmark experiments is to validate some of the hypotheses
that motivates the use of a HIT-BUNDLE and the design of a worker-aware scheduling algorithm
that minimizes tasks switching for the crowd workers.
Batch Split-up
The first question we address is whether smaller or larger batches of homogeneous HITs
are more attractive to the workers on AMT. We experimentally check if a single large batch
executes faster than when breaking the same batch into smaller ones. To this end, we use the
batch B6 which we split into 1, 10 and 60 individual batches, containing respectively 600, 60
and 10 HITs each. Next, we run all these batches on AMT concurrently with non-indicative
titles and similar unit prices of $0.01. Note that the batch combinations were published at
the same time on the crowdsourcing platform so all the variables like crowd population and
116
7.4. Experimental Evaluation
0
200
400
600
0 500 1000 1500 2000Time (seconds)
#HIT
s R
emai
ning
variable
1 Batch of 600 HITs
10 Batches of 60Hits
60 Batches of 10Hits
Figure 7.4 – A performance comparison of batch execution time using different groupingstrategies publishing a large batch of 600 HITs vs smaller batches (From B6).
size, concurrent requesters, and rewards are the same across the different settings. Figure
7.4 shows how the three different batch splitting strategies executed overtime on B6. We
observe that running B6 as one large batch of 600 HITs completed first. We also observe that
the strategy with 10 batches only really kicks-off when the large batch finishes (and similarly
for the strategy with 60 batches). From this experiment, we conclude that larger batches
provide a better throughput and constitute a better organizational strategy. This finding is
especially interesting for requesters who would periodically run queries that use a common
crowdsourcing operator (albeit, with a different input), by pushing new HITs into an existing
HIT-BUNDLE.
Merging Heterogenous Batches
We extend the above experiment to compare the execution of two heterogenous batches run
separately or within a single HIT-BUNDLE. Unlike the previous experiment, where the fine-
grained batches were one to two orders of magnitude smaller than the larger one, this scenario
involves two batches of type B6 and B7 containing 96 HITs each, versus one HIT-BUNDLEregrouping all 192 HITs. We run the three batches concurrently on AMT, with non-indicative
titles and similar unit prices of $0.01 and without altering the default serving order within the
HIT-BUNDLE 3. The results are depicted in Figure 7.5. Again, the HIT-BUNDLE exhibits a faster
throughput as compared to individual batches. Moreover, the embedded batches both finish
before their counterparts that are running separately.
At this point, we have shown that requesters who would run queries invoking different crowd-
sourcing operators can also benefit from pushing their HITs into the same HIT-BUNDLE. Since
a DBMS might support multiple crowdsourcing operators, the next question we explore is
whether context switches (i.e., alternating HIT types) affects workers efficiency.
3We observe that AMT randomly selects the input to serve.
117
Chapter 7. Human Intelligence Task Scheduling
0
25
50
75
100
0 1000 2000 3000 4000Time (seconds)
#HIT
s R
emai
ning
B6 − Outlet
B7 − Outlet
B6
B7
Figure 7.5 – A performance comparison of batch execution time using different groupingstrategies publishing two distinct batches of 192 HITs separately vs combined inside anHIT-BUNDLE.
Workers Sensitivity to Context Switch
The following experimental setup involves three groups of 24 distinct workers each. Each
group was exposed to three types of HIT serving strategies, namely:
• RR: a worker in this group would receive work in an alternating order from types
{B6,B7,B6,B7, ..,B6,B7} etc.
• SEQ10: here the workers will receive 10 tasks from B6 then 10 tasks from B7 then again
10 from B6 and so on.
• SEQ25: similar to SEQ10 but with sequences of 25 tasks. In order to trigger the context
switch, each participant was asked to do at least 10, and up to 100 tasks.
Figure 7.6 shows the average execution time of all the 100 HITs under each execution group.
We observe that the average of execution time of HITs is worse of when using RR as compared
to workers performing longer alternating sequences in SEQ10 and SEQ25. To test the statistical
significance of these improvements, and since the distribution of HIT execution time cannot
be assumed to be normally distributed, we perform a Willcoxon signed-rank test. SEQ10
has a p=0.09 which is not enough to achieve statistical significance. However, the SEQ25
improvement over RR is statistically significant with p<0.05.
In conclusion, context switch generates a significant slowdown for the workers, thus reducing
their overall efficiency. Hence, this result motivates the design of a scheduling algorithm that
takes into account workers efficiency by scheduling longer sequences of HITs of the same
type.
7.4.3 Scheduling HITs for the Crowd
Now we move our attention to experimentally comparing the scheduling algorithm that are
used to manage the distribution of HITs within a HIT-BUNDLE.
118
7.4. Experimental Evaluation
** (p−value=0.023)** (p−value=0.023)
●● ●
0
20
40
60
RR SEQ10 SEQ25Experiment Type
Exe
cutio
n tim
e pe
r H
IT (
Sec
onds
)
RR SEQ10 SEQ25
Figure 7.6 – Average Execution time for each HIT submitted from the experimental groups RR,SEQ10 and SEQ25.
Controlled Experimental Setup
In order to develop a clear understanding of the properties of classical scheduling algorithms
when applied to crowdsourcing, we put in place an experimental setup that mitigates the
effects of workforce variability overtime4.
In our controlled setting, each experiment that we run involves between |wor k f or ce| =[Mi nw , M axw ] crowd workers at any point in time. To be within this range target, the workers
who arrive first are presented with a reCaptcha to solve (paid $0.01 each), until Mi nw workers
join the system, at that point the experiment begins serving tasks. From that point on, new
workers are still accepted up to a maximum M axw . If the number of active sessions drops
bellow Mi nw , then the system starts accepting new sessions again. Unless otherwise stated,
we use the following configuration:
• |wor k f or ce| = [10,15].
• Fair Sharing, with price as weighting factor.
• a HIT-BUNDLE of {B1,B2,B3,B4,B5}.
• FIFO order is [B1,B2,B3,B4,B5].
• SJF order is [B4,B3,B5,B2,B1].
Also, we note that each experiment involves a distinct crowd of workers to avoid any further
training effects on the tasks.
4We decided not to run simulations, but rather to report the actual results obtained with human workers as partof the evaluated system.
119
Chapter 7. Human Intelligence Task Scheduling
0
500
1000
1500
2000
B1 B2 B3 B4 B5Batch
Tim
e (S
econ
ds)
FIFO FS RR SJF
(a) Batch Latency
0
500
1000
1500
2000
FIFO FS RR SJFScheduling Scheme
Tim
e (S
econ
ds)
(b) Overall Experiment Latency
Figure 7.7 – Scheduling approaches applied to the crowd.
Comparing Scheduling Algorithms
First, we compare how different scheduling algorithms perform from a latency point of view,
taking into account the results of individual batches as well as the overall performance. We
create a HIT-BUNDLE out of {B1,B2,B3,B4,B5}, which is then published to AMT. In each run, we
use a different scheduling algorithm from: FIFO, FS, RR, and SJF, with |wor k f or ce| = [10,15].
Figure 7.7 shows the completion time of each batch in our experimental setting and the
cumulative execution time of the whole HIT-BUNDLE.
FS achieved the best overall performance, thus maximizing the system utility, though, at the
batch level, FS did not always win (e.g., for B2). We see how FIFO just assigns tasks from a batch
until it is completed. In our setup, we used the natural order of the batches, which explains
why B1 is getting a preferential treatment as compared to B5, which finishes last. Similarly,
SJF performs unfairly over all the batches but manages to get B4 completed extremely fast. In
fact, SJF uses statistics collected from the system on the execution speed of each operator (see
Table 7.1); this explains the fast execution of B4. On the positive side, we observe that both RR
and FS perform best in terms of fairness with respect to the different batches, i.e., there was
no preferential treatment.
Varying the Control Factors
In order to test our priority control mechanism across different batches of a HIT-BUNDLE(tuned using the price), we run an experiment with the same setup as in Section7.4.3, but
varying the price attached to B2 and using the FS algorithm only. Figure 7.8 shows that
batches with a higher priority (reward) lead to faster completion times using the FS scheduling
approach (gray bar of batch 2 lower than the black one). This comes at the expense of other
batches being completed later.
Another dimension that we vary is the crowd size. Figure 7.8b shows the batch completion time
of two different crowdsourcing experiments when we vary the crowd size from |wor k f or ce| =[10,15] to |wor k f or ce| = [20,25] (keeping all other settings constant). We can see batches
120
7.4. Experimental Evaluation
0
300
600
900
B1 B2 B3 B4 B5Batch
Tim
e (s
econ
ds)
B2:$0.02
B2:$0.05
(a)Vary The Price
0
250
500
750
1000
B1 B2 B3 B4 B5Batch
Tim
e (s
econ
ds)
10 workers
20 workers
(b) Vary The Workforce
Figure 7.8 – (a) Effect of increasing B2 priority on batch execution time. (b) Effect of varyingthe number of crowd workers involved in the completion of the HIT batches.
● ● ●● ● ●●● ● ●●● ●● ●● ●●● ● ● ●
● ●● ●●●
●● ●●● ●●●● ●● ●● ●●●●
● ●● ●● ●● ●● ●● ●● ● ●● ●●●
● ●● ●● ●● ●●● ● ●●●
●●●● ●
● ●
23456789
101112131415161718
01:24 01:25 01:26 01:27Time
Wor
ker I
D
● ●Assignement of a Collaborative Batch Assignement of a Normal Batch
Figure 7.9 – An example of a successful scheduling of a collaborative task involving 3 workerswithin a window of 10 seconds.
being completed faster when more workers are involved. However, different batches obtain
different levels of improvement.
Gang Scheduling Algorithm
We now turn to gang scheduling. Figure 7.9 shows a crowdsourcing experiment where (addi-
tionally to the default 5 HIT types), the HIT-BUNDLE contained one additional collaborative
task which required exactly three workers at the same time on the same HIT. In detail, the task
was asking three workers to collaboratively edit a Google Document to translate a news article.
As we can see on the task assignment plot, the gang scheduling algorithm is waiting for three
workers to be available in a time window τ=10sec before assigning the collaborative task (see
Section 7.3.5).
Figure 7.10 compares how two different gang scheduling algorithms behave in terms of ac-
curacy and precision. In this setting, accuracy measures the number of correct scheduling
and no-scheduling decisions made when considering all the decisions taken by the scheduler.
Precision measures the number of correct scheduling decisions over all scheduling decisions.
121
Chapter 7. Human Intelligence Task Scheduling
2 Workers 3 Workers 4 Workers 5 Workers
0.00
0.25
0.50
0.75
1.00
5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20Window Size (Seconds)
Acc
urac
y
(a) Accuracy
Crowd−GS Naive−GS
2 Workers 3 Workers 4 Workers 5 Workers
0.00
0.25
0.50
0.75
1.00
5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20Window Size (Seconds)
Pre
cisi
on
(b) Precision
Figure 7.10 – Accuracy and precision of gang scheduling methods.
We observe that for a short time window τ < 4 seconds, CGSs schedules HITs with a higher
accuracy than NGS. The reason is that CGS decides not to schedule a HIT if the time window
is too small (and in that sense, it makes the correct decision). However, both approaches fail
short at having good precision given the small window constraint.
As we increase the time window, we observe that precision increases also (i.e., more scheduling
decisions are correct) while the accuracy of CGS decreases because the approach starts making
wrong scheduling decisions. When the windows size becomes larger, precision is high (e.g.,
it is easy to find 3 workers available within 20 seconds) and accuracy grows again (e.g., for 2
workers and window of more than 15 seconds). Obviously, larger windows are suboptimal as
they require workers to wait longer before starting a HIT. Also, the larger the gang requirement,
the more difficult it gets to come up with a precise schedule.
7.4.4 Live Deployment Evaluation
After the initial evaluation of the different dimensions involved in scheduling HITs over the
crowd, we now evaluate our proposed fair scheduling techniques FS and WCFS in an un-
controlled crowdsourcing setting using HIT-BUNDLE, and compare it against a standard AMT
execution.
More specifically, we create a workload that mimics a 1-hour activity on AMT from a real
requester who had 28 batches running concurrently. Since we do not have access to the input
of the batches, we randomly select batches from all our experimental datasets and adapt the
price and the size to the actual trace. The trace used in that sense is composed of 28 batches
with similar rewards of $0.01; the largest batch has 45 HITs and the smallest 1 HIT only. For
122
7.4. Experimental Evaluation
0
50
100
150
Batch Type
Exe
cutio
n tim
e pe
r H
IT (
Sec
onds
)
Individual Batches WCFS FS
Figure 7.11 – Average execution time per HIT under different scheduling schemes.
analysis purposes, we group batches by size: 16 small batches (1-9 HITs), 8 medium batches
(9-15 HITs), and 4 large batches (16-45 HITs). The total size of this trace is 286 HITs.
Live Deployment Experimental Setup
We publish concurrently the 28 batches from the previously described trace as individual
batches (standard approach) as well as into two HIT-BUNDLEs, one using FS and the other
using WCFS. The individual batches use meaningful titles and descriptions of their associated
HIT types; on the other hand the HIT-BUNDLE informs the crowd workers that they might
receive HITs from different categories. Other parameters like requester name and reward are
similar.
Average Execution Time
Figure 7.11 shows the average HIT execution time obtained by the different setups. Confirming
the results from Section 7.4.2, we observe that workers perform better when working on
individual batches because of the missing context switch effect (though the performance
difference is minimal). Instead, when HITs are scheduled, execution time increases with the
benefit of prioritizing certain batches. We also see that WCFS provides a trade-off between
letting workers work on the same type of HITs longer and having the ability to schedule batches
fairly as we shall see next.
Results of the Live Deployment Run
We plot the CDFs of HIT completion per category in Figure 7.12. For example, 25% of small
batches completed in 500 seconds when run individually. For all batch sizes, we observe that
123
Chapter 7. Human Intelligence Task Scheduling
large medium small
0.00
0.25
0.50
0.75
1.00
0 1000 2000 0 1000 2000 0 1000 2000Time (seconds)
CD
F
FS Individual Batches WCFS
Figure 7.12 – CDF of different batch sizes and scheduling schemes.
individual batches started faster. However, in all cases they also ended last, especially for
smaller batches suffering from some starvation (i.e., long period without progress); here, we
clearly see the benefits of both FS and WCFS at load balancing.
The final plot (Figure 7.13) shows how a large workload executes over time on the crowd-
sourcing platform. We can see how many workers are involved in each setting and which HIT
batch they are working on (each color represents a different batch). Finally, as expected, the
number of active workers varied wildly overtime in each setup. Corroborating the results of
the previous paragraph, Individual Batches received more workforce in the beginning (they
start faster) then workers either left, or took some time to spill over the remaining batches
in the [11:25 - 11:35] time period. Our main observation is that FS and WCFS i) achieve their
desired property of load balancing the batches when there are sufficient number of workers,
ii) they finish all the jobs well before the individual execution (10-15 minutes considering the
95th percentile).
7.5 Related Work on Task Scheduling
Collaborative Crowdsourcing
Some crowdsourcing applications (e.g., games with a purpose [156]) may involve multiple
persons to complete the task at hand. The most notable example of such applications is the
ESP game [158] where two players are presented with a given image and have to type image
tags as fast as possible; a tag is accepted only if both players enter it. In this case, scheduling
approaches that aim at assigning multiple workers to the same HIT are required. In our work,
we studied how gang scheduling can be adapted to the micro-task crowdsourcing setting.
Crowdsourced Workflows
Scheduling HITs is also beneficial in the case of crowdsourced workflows [96]. When more
than one HIT batch has to be crowdsourced in order for the system to produce its desired
output, it is important to make sure that batches executed in parallel get the right priority over
124
7.5. Related Work on Task Scheduling
0
10
20
30
0
10
20
30
0
10
20
30
FS
Individual Batches
WC
FS
11:20 11:30 11:40 11:50Time
#Act
ive
Wor
kers
Figure 7.13 – Worker allocation with FS, WCFS and classical individual batches in a live deploy-ment of a large workload derived from crowdsourcing platform logs. Each color represents adifferent batch.
the crowd. While this is very difficult to ensure in standard micro-task crowdsourcing, we can
obtain such prioritization thanks to the techniques proposed in our work.
The Effect of Switching Tasks
When scheduling HITs for the crowd, it is necessary to take the human dimension into account.
Recent work [101] showed how disrupting HIT continuity degrades the efficiency of crowd
workers. Taking this result into account, we designed worker-conscious scheduling approaches
that aim at serving HITs of the same type in sequence to crowd workers in order to leverage
training effects and to avoid the negative effects of context switching.
Studies in the psychology domain have shown that switching between different HIT types has
a negative effect on worker reaction time and on the quality of the work done (see, for example,
[41]). In addition to this, in this chapter we show how context switch leads to an overall larger
latency in work completion (Section 7.4.2) and propose scheduling techniques that take this
human factor into account. The authors of [171] study the effect of monetary incentives
on task switching concluding that providing such incentives can help in motivating quality
work in a task switching situation. In our work, we rather aim at reducing task switching by
consciously scheduling tasks to workers.
125
Chapter 7. Human Intelligence Task Scheduling
7.6 Conclusions
In a shared crowd-powered system environment, multiple users (or tenants) periodically
issue queries that involve a set of crowd-operators (as supported by the system), resulting in
independent crowdsourcing campaigns published on the crowdsourcing platform. In this
chapter, we pose and experimentally show that the divide strategy is not optimal, and that
the crowd-powered system can increase its overall efficiency by bundling requests into a
single one that we call: HIT-BUNDLE. Our micro-benchmarks show that this approach has
two benefits i) it creates larger batches that have a higher throughput, and ii) it gives the
system the control on what HIT to push next, a feature that we leverage to push high-priority
requests for example, or to provide specific operators needs (e.g., gang scheduling or workflow
management).
Fairness is an important feature that a shared environment should support, including a
crowd-powered system. Thus, we explore the problem of scheduling HITs using weighted Fair
Scheduling algorithms, where priority is expressed as a function of price. However, human
individuals behave very differently from machines, they are sensitive to the context switch that
a regular scheduler might cause. The negative effects of context switching were visible in our
micro benchmarks and are also supported by related studies in psychology.
We proposed a Worker Conscious Fair scheduling (WCFS), a new scheduling variant that
strikes a balance between minimizing the context switches and the fairness of the system.
We experimentally validated our algorithms over real crowds of workers on a popular paid
micro-task crowdsourcing platform running both controlled and uncontrolled experiments.
Our results show that it is possible to achieve i) a better system efficiency—as we reduce the
overall latency of a set of batches—while ii) providing fair executions across batches, resulting
in iii) non starving small jobs.
126
8 Conclusions
In this thesis, we investigated, designed, and evaluated several methods and algorithms that
improve the efficiency and effectiveness in hybrid human-machine systems. These two dimen-
sions form what we refer to as the Quality of Service of a crowd-powered system. As such, we
explored several aspects related to the execution of batches of HITs on a crowdsourcing plat-
form including quality assurance, routing, retention, and load balancing. All of our proposed
methods take into account inherent human properties (e.g., unpredictability, preferences, and
poor context switching) in order to achieve their respective goals.
We started by tackling the aggregation of responses of multiple-choice questions in order to
lower the error rate in Chapter 4. We dynamically assigned ad-hoc weights to crowd workers
using probabilistic inference based on either gold standard test questions or from consensus
among previously screened workers. We also proposed a novel crowdsourcing mechanism
called push in Chapter 5, which matches tasks to crowd participants based on their general
interests infered from their social profiles.
Next, we turned our attention to efficiency. In Chapter 6 we explored worker retention as a
mean to reduce the execution time of a batch of tasks, and avoid its starvation. We achieved
retention using punctual bonuses as an alternative to increasing the overall batch reward. Load
balancing is another technique that we investigated in Chapter 7 with the aim to improve the
overall efficiency of a shared crowd-powered system that runs several heterogeneous batches
of HITs. While this method has been previously applied to CPUs and clusters, applying it on
top of a crowdsourcing platform requires careful scheduling decisions that maximize task
continuity for each worker.
Finally, the methods explored in this thesis were designed with an eye on scalability; this aspect
will prove valuable especially if both the demand and offer in the crowdsourcing market will
grow in the future. In fact, crowdsourcing platforms might have millions of workers requesting
new tasks to be completed. A smart scheduling system not only has to make a decision on
which worker gets which task, but also has to cope with the increasing load (i.e., thousands of
scheduling decisions per second). For that purpose, our contributions are modular, scalable,
127
Chapter 8. Conclusions
and can be integrated separately or combined in a CrowdManager – the logical interface that
bridges a computer program with a paid micro-task crowdsourcing platform.
8.1 Future Work
There are many research directions that are worth investigating in order to improve the QoS
in crowd-powered systems. In the following, we present some important ideas that could be
pursued as an extension of this work, together with ideas that would require new platforms
and crowd organizations.
8.1.1 Toward Crowsourcing Platforms with an Integrated CrowdManager
The CrowdManager components studied throughout this thesis were designed individually;
Combined, they can form the basis of a novel crowdsourcing platform that offers new capa-
bilities to both the requesters and the workers. We envision that such platform will operate
in a push-crowdsourcing mode where tasks will be scheduled to meet the workloads of tasks
published by the requesters. Our scheduling algorithm will take into account the skills of the
prospective workers. Our answer aggregation mechanism will use more precise priors, that is,
the skills of the workers. The task pricing will also be dynamic, taking into account both the
workload on the platform and the skills of the workers. The crowd workers will automatically
receive tasks tailored to their interests and general knowledge without the need to waste time
browsing a long list of tasks on a dashboard as it is the case today. Because these changes
require full knowledge of the workload, workforce, worker profiles etc., we believe that only a
full-fledged platform has the power to provide such a deep integration.
8.1.2 Worker Flow
As we saw in Chapter 6, one of the benefits of worker retention is that it can lead to faster
completion and non-starving crowdsourcing campaigns. In our system, we retained people
by using bonuses. Although, this is a common human resources practice, other retention
schemes could be investigated.
Flow, in psychology, is a concept that designates the state of mind in which an individual
is completely immersed in an activity [42]. As, Figure 8.1 illustrates, being in the flow state
is a balance between the skills that the person has in conducting an activity with a certain
difficulty level. As such, if the person is overskilled, i.e., has high skills as compared to the
given task, he might quickly get bored. Likewise, if the person does not have the skills to
conduct a complex task, he might quickly get anxious. We can hypothesize that maintaining
Flow is desirable for a micro-task worker; both by improving his/her experience, and also
maintaining a high answer quality and low response time. Given the repetitive and potentially
dull nature of micro-tasks, a batch of HITs can be dynamically altered so that it continuously
128
8.1. Future Work
SkillsDifficulty
Flow
Boredom
Anxiety
Figure 8.1 – The concept of the Flow Theory [42].
challenge or relax the worker to keep him/her in a Flow state. The system should automatically
sense and act respectively in order to help the worker reach and maintain that state. One
possible direction is to create a strategy based on the expected response time of each task
type. For example, if the worker is exhibiting a response time that is consistently lower than
the mean, then the workers might be too skilled for the task at hand and can eventually get
bored. On the contrary, if the response time is higher than the mean, then the worker is most
probably struggling with the task. The system will then dynamically respond to these signals
by proposing easier or respectively more challenging tasks.
8.1.3 HIT Recommender System
In Chapter 5 we mostly described and evaluated task routing from a system perspective and
as a mean to improve the quality of the submitted responses. Still, task routing is essentially a
recommendation system, one that is beneficial to the crowd workers as well. Effective HIT
recommendation would reduce the time to find an interesting or suitable batch to work on
and improves the worker’s productivity.
Our task matching technique relies on the workers’ social profiles; if such information is
not available, one can apply machine learning techniques and infer workers skills and their
knowledge automatically based on historical data, e.g., previously chosen tasks, performance
per task type etc. For example, if the task requires a movie-savvy crowd, we can use a system
similar to a movie-recommender in order to match the tasks to prospective workers.
An initial step in this direction is OpenTurk [7], a Chrome extension that we built which allows
AMT workers to manage their favorite requesters, share HITs they like with other workers
and work on HITs that other workers have liked. Openturk has a recommendation tab that
recommends tasks to workers. Currently, this feature recommends tasks based on their
popularity.
129
Chapter 8. Conclusions
8.1.4 Crowd-Powered Big Data Systems
The engineering efforts around crowdsourcing for data management has been geared toward
DBMSs. While this is a valid pursuit, the relatively shy commercial adoption of this model is
reminiscent of the performance that the crowd can provide in comparison to native operations.
An alternative engineering effort would be to build crowdsourcing modules for batch-oriented
data management systems, where faulty and late execution of some units is tolerable by
design. One can leverage the ManReduce programming model proposed in [10], to extend
MapReduce implementation of Hadoop. In this model, HITs will be initiated and scheduled
(see Chapter 7) like any other execution unit; with the difference that HITs will be sent to a
crowdsourcing platform to be examined by crowd workers. Once each HIT is submitted, the
results are collected and integrated with the rest of the execution pipeline.
8.1.5 Social and Mobile Crowdsourcing
Some tasks can only be completed by a very limited group of people, e.g., translate a dialect
into English, find a missing person, or recognize the geographical location of a place shown in
a photograph. Assuming that this target group of people could be incentivized to perform the
task, the question is how to find them quickly. A possible solution is to build crowdsourcing
platforms with social connections [173]. Here, the workers are no longer isolated – already
many communicate and share thoughts on specialized forums – and can choose to be solicited
on the go. We can then introduce the notion of a referral-task, where a worker gets paid for
referring to the right person or contributing to a successful referral chain.
8.2 Outlook
Crowdsourcing offers a new and unique form of income to web users. This has remarkable
social implications, like breaking geographical barriers and opening new opportunities for
the less favored parts of the world, and for unskilled people. For companies, having an
elastic workforce through crowdsourcing facilitates agile processes, and helps solving complex
tasks at scale and on-demand without long term commitments. Crowdsourcing, however,
raises several concerns regarding global employment, fair wages and social security for crowd
workers. Likewise, companies and requesters have less flexibility in terms of data to be exposed,
and are often worried about low quality results obtained from the crowd.
While the future of crowdsourcing is yet to be shaped – both from a technology and legal
framework perspectives – it is clear that there is a strong market potential. We can foresee that
the platform of the future can: (1) Offer Service Level Agreements to requesters such that they
can use crowdsourcing in mission critical applications. (2) Propose suitable tasks for workers
in order to maximize their revenue and productivity. The platform could provide training
programs for the workers to learn new skills, earn a degree or a certificate. Workers can even
develop platform-specific skills, e.g., manage complex client jobs, decompose larger tasks into
smaller ones, and facilitate collaborative tasks.
130
Bibliography
[1] Amazon mechanical turk. http://www.mturk.com. Last accessed: 2014-12-30.
[2] Clickworker. http://www.clickworker.com. Last accessed: 2014-12-30.
[3] CloudFactory: making data valuable in a hyper-efficient way. http://www.cloudfactory.com.
Last accessed: 2014-12-30.
[4] CrowdFlower people-powered data enrichment platform. http://www.crowdflower.com. Last
accessed: 2014-12-30.
[5] Facebook. http://www.facebook.com/. Last accessed: 2014-12-30.
[6] Mobileworks. http://www.mobileworks.com. Not accessible as of: 2015-01-08.
[7] Openturk. http://www.openturk.com/. Last accessed: 2014-12-30.
[8] psiTurk: crowdsource your research. https://psiturk.org/. Last accessed: 2014-12-30.
[9] M. Agrawal, M. Karimzadehgan, and C. Zhai. An online news recommender system for social
networks. In Proceedings of ACM SIGIR workshop on Search in Social Media, 2009.
[10] S. Ahmad, A. Battle, Z. Malkani, and S. Kamvar. The jabberwocky programming environment
for structured social computing. In Proceedings of the 24th annual ACM symposium on User
interface software and technology, pages 53–64. ACM, 2011.
[11] S. Allan and E. Thorsen. Citizen journalism: Global perspectives, volume 1. Peter Lang, 2009.
[12] O. Alonso and R. A. Baeza-Yates. Design and Implementation of Relevance Assessments Using
Crowdsourcing. In ECIR, pages 153–164, 2011.
[13] Y. Amsterdamer, Y. Grossman, T. Milo, and P. Senellart. Crowd mining. In Proceedings of the 2013
international conference on Management of data, pages 241–252. ACM, 2013.
[14] A. Anagnostopoulos, L. Becchetti, C. Castillo, A. Gionis, and S. Leonardi. Power in unity: Forming
teams in large-scale community systems. In Proceedings of the 19th ACM International Conference
on Information and Knowledge Management, CIKM ’10, pages 599–608, New York, NY, USA, 2010.
ACM.
[15] A. Anagnostopoulos, L. Becchetti, C. Castillo, A. Gionis, and S. Leonardi. Online team formation
in social networks. In Proceedings of the 21st International Conference on World Wide Web, WWW
’12, pages 839–848, New York, NY, USA, 2012. ACM.
[16] D. Arthur. The employee recruitment and retention handbook. AMACOM Div American Mgmt
Assn, 2001.
131
Bibliography
[17] P. Bailey, A. P. de Vries, N. Craswell, and I. Soboroff. Overview of the TREC 2007 Enterprise Track.
In TREC, 2007.
[18] K. Balog, Y. Fang, M. de Rijke, P. Serdyukov, and L. Si. Expertise retrieval. Foundations and Trends
in Information Retrieval, 6(2-3):127–256, 2012.
[19] K. Balog, P. Serdyukov, and A. P. de Vries. Overview of the TREC 2010 Entity Track. In TREC, 2010.
[20] K. Balog, P. Thomas, N. Craswell, I. Soboroff, P. Bailey, and A. De Vries. Overview of the trec 2008
enterprise track. Technical report, DTIC Document, 2008.
[21] M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open Information
Extraction from the Web. In IJCAI, pages 2670–2676, 2007.
[22] C. Bartlett and S. Ghoshal. Building competitive advantage through people. Sloan Mgmt. Rev,
43(2), 2013.
[23] S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering.
In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and
data mining, pages 59–68. ACM, 2004.
[24] M. S. Bernstein, J. Brandt, R. C. Miller, and D. R. Karger. Crowds in two seconds: enabling realtime
crowd-powered interfaces. In UIST ’11, pages 33–42. ACM, 2011.
[25] M. S. Bernstein, J. Teevan, S. Dumais, D. Liebling, and E. Horvitz. Direct answers for search
queries in the long tail. In CHI ’12, pages 237–246. ACM, 2012.
[26] J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R. C. Miller, R. Miller, A. Tatarowicz, B. White,
S. White, et al. Vizwiz: nearly real-time answers to visual questions. In UIST, pages 333–342.
ACM, 2010.
[27] M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity
measures. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge
discovery and data mining, KDD ’03, pages 39–48, New York, NY, USA, 2003. ACM.
[28] R. Blanco, H. Halpin, D. Herzig, P. Mika, J. Pound, H. S. Thompson, and D. T. Tran. Repeatable
and reliable search system evaluation using crowdsourcing. In SIGIR, pages 923–932, 2011.
[29] R. Blanco, P. Mika, and S. Vigna. Effective and Efficient Entity Search in RDF Data. In International
Semantic Web Conference (ISWC), pages 83–97, 2011.
[30] P. Bouquet, H. Stoermer, C. Niederee, and A. Mana. Entity Name System: The Backbone of an
Open and Scalable Web of Data. In Proceedings of the IEEE International Conference on Semantic
Computing, ICSC 2008, pages 554–561.
[31] A. Bozzon, M. Brambilla, and S. Ceri. Answering search queries with CrowdSearcher. In WWW,
pages 1009–1018, New York, NY, USA, 2012. ACM.
[32] A. Bozzon, M. Brambilla, S. Ceri, and A. Mauri. Extending search to crowds: A model-driven
approach. In SeCO Book, pages 207–222. 2012.
[33] A. Bozzon, M. Brambilla, and A. Mauri. A model-driven approach for crowdsourcing search. In
CrowdSearch, pages 31–35, 2012.
[34] A. Bozzon, I. Catallo, E. Ciceri, P. Fraternali, D. Martinenghi, and M. Tagliasacchi. A framework
for crowdsourced multimedia processing and querying. In CrowdSearch, pages 42–47, 2012.
132
Bibliography
[35] L. Breiman and A. Cutler. Random Forests. https://www.stat.berkeley.edu/~breiman/
RandomForests/cc_home.htm. Last accessed: 2015-03-04.
[36] R. C. Bunescu and M. Pasca. Using encyclopedic knowledge for named entity disambiguation.
In EACL, 2006.
[37] M. Catasta, A. Tonon, D. E. Difallah, G. Demartini, K. Aberer, and P. Cudré-Mauroux. Transac-
tivedb: Tapping into collective human memories. Proceedings of the VLDB Endowment, 7(14),
2014.
[38] D. Chandler and J. J. Horton. Labor Allocation in Paid Crowdsourcing: Experimental Evidence
on Positioning, Nudges and Prices. In Human Computation, 2011.
[39] P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. IEEE
Trans. on Knowl. and Data Eng., 24(9):1537–1555, Sept. 2012.
[40] M. Ciaramita and Y. Altun. Broad-coverage sense disambiguation and information extraction
with a supersense sequence tagger. In Proceedings of the 2006 Conference on Empirical Methods
in Natural Language Processing, EMNLP ’06, pages 594–602, Stroudsburg, PA, USA, 2006. ACL.
[41] M. J. Crump, J. V. McDonnell, and T. M. Gureckis. Evaluating amazon’s mechanical turk as a tool
for experimental behavioral research. PloS one, 8(3):e57410, 2013.
[42] M. Csikszentmihalyi and M. Csikzentmihaly. Flow: The psychology of optimal experience, vol-
ume 41. HarperPerennial New York, 1991.
[43] S. Cucerzan. Large-scale named entity disambiguation based on Wikipedia data. In Proceedings
of EMNLP-CoNLL, volume 2007, pages 708–716, 2007.
[44] P. Cudré-Mauroux, K. Aberer, and A. Feher. Probabilistic Message Passing in Peer Data Manage-
ment Systems. In International Conference on Data Engineering (ICDE), 2006.
[45] P. Cudré-Mauroux, P. Haghani, M. Jost, K. Aberer, and H. De Meer. idMesh: graph-based disam-
biguation of linked data. In WWW ’09, pages 591–600, New York, NY, USA, 2009. ACM.
[46] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A framework and graphical
development environment for robust NLP tools and applications. In Proceedings of the 40th
Anniversary Meeting of the ACL, 2002.
[47] S. B. Davidson, S. Khanna, T. Milo, and S. Roy. Using the Crowd for Top-k and Group-by Queries.
In Proceedings of the 16th International Conference on Database Theory, ICDT ’13, pages 225–236,
New York, NY, USA, 2013. ACM.
[48] J. Davis, J. Arderiu, H. Lin, Z. Nevins, S. Schuon, O. Gallo, and M.-H. Yang. The HPU. In Computer
Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on,
pages 9–16. IEEE, 2010.
[49] A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the
em algorithm. Applied statistics, pages 20–28, 1979.
[50] G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. ZenCrowd: leveraging probabilistic reason-
ing and crowdsourcing techniques for large-scale entity linking. In WWW, pages 469–478, New
York, NY, USA, 2012.
[51] G. Demartini, B. Trushkowsky, T. Kraska, M. J. Franklin, and U. Berkeley. Crowdq: Crowdsourced
query understanding. In CIDR, 2013.
133
Bibliography
[52] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM
algorithm. Journal of the Royal Statistical Society, 39, 1977.
[53] D. Deng, C. Shahabi, and U. Demiryurek. Maximizing the number of worker’s self-selected tasks
in spatial crowdsourcing. In Proceedings of the 21st ACM SIGSPATIAL International Conference
on Advances in Geographic Information Systems, SIGSPATIAL’13, pages 324–333, New York, NY,
USA, 2013. ACM.
[54] E. Diaz-Aviles and R. Kawase. Exploiting twitter as a social channel for human computation. In
CrowdSearch, pages 15–19, 2012.
[55] D. E. Difallah, G. Demartini, and P. Cudré-Mauroux. Mechanical cheat: Spamming schemes and
adversarial techniques on crowdsourcing platforms. In CrowdSearch, pages 26–30, 2012.
[56] D. E. Difallah, G. Demartini, and P. Cudré-Mauroux. Pick-a-crowd: tell me what you like, and
i’ll tell you what to do. In Proceedings of the 22nd international conference on World Wide Web,
pages 367–374. International World Wide Web Conferences Steering Committee, 2013.
[57] X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces.
In SIGMOD, pages 85–96. ACM, 2005.
[58] P. Donmez, J. G. Carbonell, and J. Schneider. Efficiently learning the accuracy of labeling sources
for selective sampling. In Proceedings of the 15th ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 259–268. ACM, 2009.
[59] P. Donmez, J. G. Carbonell, and J. G. Schneider. A probabilistic framework to learn from multiple
annotators with time-varying accuracy. In SDM, volume 2, page 1. SIAM, 2010.
[60] J. S. Downs, M. B. Holbrook, S. Sheng, and L. F. Cranor. Are your participants gaming the system?:
screening mechanical turk workers. In Proceedings of the SIGCHI Conference on Human Factors
in Computing Systems, pages 2399–2402. ACM, 2010.
[61] C. Eickhoff and A. P. de Vries. Increasing cheat robustness of crowdsourcing tasks. Information
retrieval, 16(2):121–137, 2013.
[62] S. Faradani, B. Hartmann, and P. G. Ipeirotis. What’s the right price? pricing tasks for finishing on
time. In Human Computation, 2011.
[63] D. G. Feitelson and L. Rudolph. Gang scheduling performance benefits for fine-grain synchro-
nization. Journal of Parallel and Distributed Computing, 16(4):306 – 318, 1992.
[64] M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. CrowdDB: answering queries
with crowdsourcing. In Proceedings of the 2011 ACM SIGMOD International Conference on
Management of data, SIGMOD ’11, pages 61–72, New York, NY, USA, 2011. ACM.
[65] U. Gadiraju, R. Kawase, and S. Dietze. A taxonomy of microtasks on the web. In Proceedings of
the 25th ACM Conference on Hypertext and Social Media, HT ’14, pages 218–223, New York, NY,
USA, 2014. ACM.
[66] L. Getoor and A. Machanavajjhala. Entity Resolution: Tutorial. In VLDB, 2012.
[67] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica. Dominant resource
fairness: fair allocation of multiple resource types. In NSDI’11, pages 24–24. USENIX Association,
2011.
[68] J. A. Golbeck. Computing and applying trust in web-based social networks. PhD thesis, College
Park, MD, USA, 2005. AAI3178583.
134
Bibliography
[69] S. Guo, A. Parameswaran, and H. Garcia-Molina. So who won?: dynamic max discovery with the
crowd. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of
Data, pages 385–396. ACM, 2012.
[70] K. Haas, P. Mika, P. Tarjan, and R. Blanco. Enhanced results for web search. In SIGIR, pages
725–734, 2011.
[71] X. Han, L. Sun, and J. Zhao. Collective entity linking in web text: a graph-based method. In SIGIR,
pages 765–774, New York, NY, USA, 2011. ACM.
[72] X. Han and J. Zhao. Named entity disambiguation by leveraging wikipedia semantic knowledge.
In Proceeding of the 18th ACM conference on Information and knowledge management, CIKM ’09,
pages 215–224, New York, NY, USA, 2009. ACM.
[73] M. Hirth, T. Hoßfeld, and P. Tran-Gia. Cost-optimal validation mechanisms and cheat-detection
for crowdsourcing platforms. In Innovative Mobile and Internet Services in Ubiquitous Computing
(IMIS), 2011 Fifth International Conference on, pages 316–321. IEEE, 2011.
[74] J. Howe. The rise of crowdsourcing. Wired magazine, 14(6):1–4, 2006.
[75] M. A. Huselid. The impact of human resource management practices on turnover, productivity,
and corporate financial performance. Academy of management journal, 38(3):635–672, 1995.
[76] P. G. Ipeirotis. Analyzing the amazon mechanical turk marketplace. XRDS: Crossroads, The ACM
Magazine for Students, 17(2):16–21, 2010.
[77] P. G. Ipeirotis and E. Gabrilovich. Quizz: targeted crowdsourcing with a billion (potential) users. In
Proceedings of the 23rd international conference on World wide web, pages 143–154. International
World Wide Web Conferences Steering Committee, 2014.
[78] P. G. Ipeirotis, F. Provost, V. S. Sheng, and J. Wang. Repeated labeling using multiple noisy labelers.
Data Mining and Knowledge Discovery, 28(2):402–441, 2014.
[79] P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. In
Proceedings of the ACM SIGKDD workshop on human computation, pages 64–67. ACM, 2010.
[80] L. C. Irani and M. S. Silberman. Turkopticon: Interrupting Worker Invisibility in Amazon Me-
chanical Turk. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems,
CHI ’13, pages 611–620, New York, NY, USA, 2013. ACM.
[81] M. Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of
Tampa, Florida. Journal of the American Statistical Association, 84(406):414–420, 1989.
[82] S. R. Jeffery, L. Sun, M. DeLand, N. Pendar, R. Barber, and A. Galdi. Arnold: Declarative crowd-
machine data integration. In CIDR, 2013.
[83] R. Jurca and B. Faltings. Mechanisms for making crowds truthful. J. Artif. Intell. Res. (JAIR),
34:209–253, 2009.
[84] D. R. Karger, S. Oh, and D. Shah. Budget-optimal task allocation for reliable crowdsourcing
systems. Operations Research, 62(1):1–24, 2014.
[85] G. Kazai. In Search of Quality in Crowdsourcing for Search Engine Evaluation. In ECIR, pages
165–176, 2011.
[86] G. Kazai, J. Kamps, M. Koolen, and N. Milic-Frayling. Crowdsourcing for book search evaluation:
impact of hit design on comparative system ranking. In SIGIR, pages 205–214, 2011.
135
Bibliography
[87] R. Khazankin, H. Psaier, D. Schall, and S. Dustdar. QoS-Based Task Scheduling in Crowdsourcing
Environments. In Proceedings of the 9th International Conference on Service-Oriented Computing,
ICSOC’11, pages 297–311, Berlin, Heidelberg, 2011. Springer-Verlag.
[88] A. Kittur, E. H. Chi, and B. Suh. Crowdsourcing user studies with mechanical turk. In Proceedings
of the SIGCHI conference on human factors in computing systems, pages 453–456. ACM, 2008.
[89] A. Kittur, S. Khamkar, P. André, and R. Kraut. Crowdweaver: visually managing complex crowd
work. In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work,
pages 1033–1036. ACM, 2012.
[90] A. Kittur, J. V. Nickerson, M. Bernstein, E. Gerber, A. Shaw, J. Zimmerman, M. Lease, and J. Hor-
ton. The future of crowd work. In Proceedings of the 2013 Conference on Computer Supported
Cooperative Work, CSCW ’13, pages 1301–1318, New York, NY, USA, 2013.
[91] A. Kittur, B. Smus, S. Khamkar, and R. E. Kraut. Crowdforge: Crowdsourcing complex work. In
Proceedings of the 24th annual ACM symposium on User interface software and technology, pages
43–52. ACM, 2011.
[92] D. Klein and C. Manning. Accurate unlexicalized parsing. In Proceedings of the 41st Annual
Meeting on Association for Computational Linguistics-Volume 1, pages 423–430. Association for
Computational Linguistics, 2003.
[93] C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate detection using shallow text features. In
WSDM, pages 441–450, 2010.
[94] S. Konomi, W. Ohno, T. Sasao, and K. Shoji. A context-aware approach to microtasking in a public
transport environment. In Communications and Electronics (ICCE), 2014 IEEE Fifth International
Conference on, pages 498–503. IEEE, 2014.
[95] F. Kschischang, B. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE
Transactions on Information Theory, 47(2), 2001.
[96] A. Kulkarni, M. Can, and B. Hartmann. Collaboratively crowdsourcing workflows with turkomatic.
In CSCW ’12, pages 1003–1012. ACM, 2012.
[97] A. Kulkarni, P. Gutheim, P. Narula, D. Rolnitzky, T. Parikh, and B. Hartmann. Mobileworks:
designing for quality in a managed crowdsourcing architecture. Internet Computing, IEEE,
16(5):28–35, 2012.
[98] R. S. Kushalnagar, W. S. Lasecki, and J. P. Bigham. A readability evaluation of real-time crowd
captions in the classroom. In Proceedings of the 14th international ACM SIGACCESS conference
on Computers and accessibility, ASSETS ’12, pages 71–78, New York, NY, USA, 2012. ACM.
[99] T. Lambert and A. Schwienbacher. An empirical analysis of crowdfunding. Social Science Research
Network, 1578175, 2010.
[100] W. S. Lasecki, C. Homan, and J. P. Bigham. Architecting real-time crowd-powered systems.
[101] W. S. Lasecki, A. Marcus, J. M. Tzeszotarski, and J. P. Bigham. Using Microtask Continuity to
Improve Crowdsourcing. In Carnegie Mellon University Human-Computer Interaction Institute -
Technical Reports - CMU-HCII-14-100, 2014.
[102] W. S. Lasecki, R. Wesley, J. Nichols, A. Kulkarni, J. F. Allen, and J. P. Bigham. Chorus: A Crowd-
powered Conversational Assistant. In Proceedings of the 26th Annual ACM Symposium on User
Interface Software and Technology, UIST ’13, pages 151–162. ACM, 2013.
136
Bibliography
[103] S. Lazebnik, C. Schmid, J. Ponce, et al. Semi-local affine parts for object recognition. In British
Machine Vision Conference (BMVC’04), pages 779–788, 2004.
[104] J. Le, A. Edmonds, V. Hester, and L. Biewald. Ensuring quality in crowdsourced search rele-
vance evaluation: The effects of training question distribution. In SIGIR 2010 workshop on
crowdsourcing for search evaluation, pages 21–26, 2010.
[105] G. Lee, B.-G. Chun, and H. Katz. Heterogeneity-aware resource allocation and scheduling
in the cloud. In Proceedings of the 3rd USENIX conference on Hot topics in cloud computing,
HotCloud’11, pages 4–4, Berkeley, CA, USA, 2011. USENIX Association.
[106] V. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet
Physics Doklady, volume 10, pages 707–710, 1966.
[107] E.-P. Lim, J. Srivastava, S. Prabhakar, and J. Richardson. Entity identification in database integra-
tion. In Data Engineering, 1993. Proceedings. Ninth International Conference on, pages 294–301.
IEEE, 1993.
[108] G. Little, L. B. Chilton, M. Goldman, and R. C. Miller. Turkit: tools for iterative tasks on mechanical
turk. In Proceedings of the ACM SIGKDD workshop on human computation, pages 29–30. ACM,
2009.
[109] C. Lofi, K. El Maarry, and W.-T. Balke. Skyline queries in crowd-enabled databases. In Proceedings
of the 16th International Conference on Extending Database Technology, EDBT ’13, pages 465–476,
New York, NY, USA, 2013. ACM.
[110] C. Macdonald and I. Ounis. Voting techniques for expert search. Knowl. Inf. Syst., 16(3):259–280,
2008.
[111] A. Mahmood, W. G. Aref, E. Dragut, and S. Basalamah. The palm-tree index: Indexing with the
crowd. 2013.
[112] A. Mao, E. Kamar, Y. Chen, E. Horvitz, M. E. Schwamb, C. J. Lintott, and A. M. Smith. Volunteering
Versus Work for Pay: Incentives and Tradeoffs in Crowdsourcing. In HCOMP, 2013.
[113] A. Mao, E. Kamar, and E. Horvitz. Why Stop Now? Predicting Worker Engagement in Online
Crowdsourcing. In First AAAI Conference on Human Computation and Crowdsourcing, 2013.
[114] A. Marcus et al. Optimization techniques for human computation-enabled data processing
systems. PhD thesis, Massachusetts Institute of Technology, 2012.
[115] A. Marcus, D. Karger, S. Madden, R. Miller, and S. Oh. Counting with the crowd. Proceedings of
the VLDB Endowment, 6(2):109–120, 2012.
[116] A. Marcus, E. Wu, D. Karger, S. Madden, and R. Miller. Human-powered sorts and joins. Proceed-
ings of the VLDB Endowment, 5(1):13–24, 2011.
[117] A. Marcus, E. Wu, D. R. Karger, S. Madden, and R. C. Miller. Crowdsourced databases: Query
processing with people. CIDR, 2011.
[118] P. N. Mendes, M. Jakob, A. García-Silva, and C. Bizer. DBpedia Spotlight: Shedding Light on
the Web of Documents. In Proceedings of the 7th International Conference on Semantic Systems
(I-Semantics), 2011.
[119] E. Michaels, H. Handfield-Jones, and B. Axelrod. The war for talent. Harvard Business Press,
2001.
137
Bibliography
[120] R. Mihalcea and A. Csomai. Wikify!: linking documents to encyclopedic knowledge. In Proceed-
ings of the sixteenth ACM conference on Conference on information and knowledge management,
CIKM ’07, pages 233–242, New York, NY, USA, 2007. ACM.
[121] P. Minder and A. Bernstein. Crowdlang: a programming language for the systematic exploration
of human computation systems. In Proceedings of the 4th international conference on Social
Informatics, SocInfo’12, pages 124–137, Berlin, Heidelberg, 2012. Springer-Verlag.
[122] J. Mortensen, M. A. Musen, and N. F. Noy. Crowdsourcing the verification of relationships in
biomedical ontologies. In AMIA, 2013.
[123] B. Mozafari, P. Sarkar, M. Franklin, M. Jordan, and S. Madden. Scaling up crowd-sourcing to very
large datasets: A case for active learning. Proceedings of the VLDB Endowment, 8(2), 2014.
[124] C. Nieke, U. Güntzer, and W.-T. Balke. Topcrowd. In Conceptual Modeling, pages 122–135.
Springer, 2014.
[125] V. Nunia, B. Kakadiya, C. Hota, and M. Rajarajan. Adaptive Task Scheduling in Service Oriented
Crowd Using SLURM. In ICDCIT, pages 373–385, 2013.
[126] B. On, N. Koudas, D. Lee, and D. Srivastava. Group linkage. In Data Engineering, 2007. ICDE
2007. IEEE 23rd International Conference on, pages 496–505. IEEE, 2007.
[127] G. Papadakis, E. Ioannou, C. Niederée, T. Palpanas, and W. Nejdl. Beyond 100 million entities:
large-scale blocking-based resolution for heterogeneous data. In Proceedings of the fifth ACM
international conference on Web search and data mining, WSDM ’12, pages 53–62, New York, NY,
USA, 2012. ACM.
[128] A. Parameswaran and N. Polyzotis. Answering queries using databases, humans and algorithms.
In Conference on Innovative Data Systems Research, volume 160, 2011.
[129] A. G. Parameswaran, H. Garcia-Molina, H. Park, N. Polyzotis, A. Ramesh, and J. Widom. Crowd-
screen: Algorithms for filtering data with humans. In Proceedings of the 2012 ACM SIGMOD
International Conference on Management of Data, pages 361–372. ACM, 2012.
[130] S. Perugini, M. A. Gonçalves, and E. A. Fox. Recommender systems research: A connection-
centric survey. J. Intell. Inf. Syst., 23(2):107–143, Sept. 2004.
[131] V. Polychronopoulos, L. de Alfaro, J. Davis, H. Garcia-Molina, and N. Polyzotis. Human-powered
top-k lists. In WebDB, pages 25–30, 2013.
[132] J. Pöschko, M. Strohmaier, T. Tudorache, N. F. Noy, and M. A. Musen. Pragmatic analysis of
crowd-based knowledge production systems with icat analytics: Visualizing changes to the
icd-11 ontology. In AAAI Spring Symposium: Wisdom of the Crowd, 2012.
[133] J. Pound, P. Mika, and H. Zaragoza. Ad-hoc object retrieval in the web of data. In WWW, pages
771–780, 2010.
[134] V. Rajan, S. Bhattacharya, L. E. Celis, D. Chander, K. Dasgupta, and S. Karanam. Crowdcontrol:
An online learning approach for optimal task scheduling in a dynamic crowd platform. In ICML
Workshop on ’Machine Learning meets Crowdsourcing’, 2013.
[135] V. C. Raykar, S. Yu, L. H. Zhao, A. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy.
Supervised learning from multiple experts: whom to trust when everyone lies a bit. In Proceedings
of the 26th Annual international conference on machine learning, pages 889–896. ACM, 2009.
138
Bibliography
[136] J. Ross, L. Irani, M. Silberman, A. Zaldivar, and B. Tomlinson. Who are the crowdworkers?: shifting
demographics in mechanical turk. In CHI’10 Extended Abstracts on Human Factors in Computing
Systems, pages 2863–2872. ACM, 2010.
[137] S. B. Roy, I. Lykourentzou, S. Thirumuruganathan, S. Amer-Yahia, and G. Das. Optimization in
knowledge-intensive crowdsourcing. CoRR, abs/1401.1302, 2014.
[138] J. M. Rzeszotarski, E. Chi, P. Paritosh, and P. Dai. Inserting micro-breaks into crowdsourcing
workflows. In HCOMP (Works in Progress / Demos), volume WS-13-18 of AAAI Workshops. AAAI,
2013.
[139] C. Sarasua, E. Simperl, and N. F. Noy. Crowdmap: Crowdsourcing ontology alignment with
microtasks. In ISWC, pages 525–541, 2012.
[140] N. Seemakurty, J. Chu, L. von Ahn, and A. Tomasic. Word sense disambiguation via human
computation. In Proceedings of the ACM SIGKDD Workshop on Human Computation, HCOMP
’10, pages 60–63. ACM, 2010.
[141] J. Selke, C. Lofi, and W.-T. Balke. Pushing the boundaries of crowd-enabled databases with
query-driven schema expansion. Proc. VLDB Endow., 5(6):538–549, Feb. 2012.
[142] A. D. Shaw, J. J. Horton, and D. L. Chen. Designing incentives for inexpert human raters. In
Proceedings of the ACM 2011 conference on Computer supported cooperative work, pages 275–284.
ACM, 2011.
[143] W. Shen, J. Wang, P. Luo, and M. Wang. Liege:: link entities in web lists with knowledge base. In
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data
mining, KDD ’12, pages 1424–1432, New York, NY, USA, 2012. ACM.
[144] V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data
mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 614–622. ACM, 2008.
[145] A. Sheshadri and M. Lease. SQUARE: A Benchmark for Research on Computing Crowd Consensus.
In Proceedings of the 1st AAAI Conference on Human Computation (HCOMP), 2013.
[146] M. S. Silberman, L. Irani, and J. Ross. Ethics and tactics of professional crowdwork. XRDS,
17(2):39–43, Dec. 2010.
[147] Y. Singer and M. Mittal. Pricing Mechanisms for Crowdsourcing Markets. In Proceedings of
the 22Nd International Conference on World Wide Web, WWW ’13, pages 1157–1166, Republic
and Canton of Geneva, Switzerland, 2013. International World Wide Web Conferences Steering
Committee.
[148] P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Inferring ground truth from subjective
labelling of venus images. Advances in neural information processing systems, pages 1085–1092,
1995.
[149] M. Stonebraker. What does ‘big data’ mean. Communications of the ACM, BLOG@ ACM, 2012.
[150] M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. Mapreduce
and parallel dbmss: friends or foes? Communications of the ACM, 53(1):64–71, 2010.
[151] A. Tonon, G. Demartini, and P. Cudre-Mauroux. Combining inverted indices and structured
search for ad-hoc object retrieval. In SIGIR, pages 125–134, 2012.
139
Bibliography
[152] B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In
Data Engineering (ICDE), 2013 IEEE 29th International Conference on, pages 673–684. IEEE, 2013.
[153] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe,
H. Shah, S. Seth, B. Saha, C. Curino, O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache
hadoop yarn: Yet another resource negotiator. In SOCC ’13, pages 5:1–5:16. ACM, 2013.
[154] P. Venetis, H. Garcia-Molina, K. Huang, and N. Polyzotis. Max algorithms in crowdsourcing
environments. In Proceedings of the 21st international conference on World Wide Web, pages
989–998. ACM, 2012.
[155] K. Vertanen and P. O. Kristensson. A versatile dataset for text entry evaluations based on gen-
uine mobile emails. In Proceedings of the 13th International Conference on Human Computer
Interaction with Mobile Devices and Services, pages 295–298. ACM, 2011.
[156] L. Von Ahn. Games with a purpose. Computer, 39(6):92–94, 2006.
[157] L. Von Ahn. Human computation. In Design Automation Conference, 2009. DAC’09. 46th
ACM/IEEE, pages 418–419. IEEE, 2009.
[158] L. von Ahn and L. Dabbish. Labeling images with a computer game. In CHI ’04, pages 319–326.
ACM, 2004.
[159] L. Von Ahn, B. Maurer, C. McMillen, D. Abraham, and M. Blum. recaptcha: Human-based
character recognition via web security measures. Science, 321(5895):1465–1468, 2008.
[160] M. Vukovic and A. Natarajan. Operational Excellence in IT Services Using Enterprise Crowd-
sourcing. In IEEE SCC, pages 494–501, 2013.
[161] J. Wang, S. Faridani, and P. Ipeirotis. Estimating the completion time of crowdsourced tasks using
survival analysis models. Crowdsourcing for search and data mining (CSDM 2011), 31, 2011.
[162] J. Wang, P. G. Ipeirotis, and F. Provost. Managing crowdsourcing workers. In The 2011 Winter
Conference on Business Intelligence, pages 10–12, 2011.
[163] J. Wang, P. G. Ipeirotis, and F. Provost. Quality-Based Pricing for Crowdsourced Workers. In NYU
Stern Research Working Paper - CBA-13-06, 2013.
[164] J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution.
Proceedings of the VLDB Endowment, 5(11):1483–1494, 2012.
[165] J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. Leveraging transitive relations for crowd-
sourced joins. In Proceedings of the 2013 international conference on Management of data, pages
229–240. ACM, 2013.
[166] P. Welinder and P. Perona. Online crowdsourcing: rating annotators and obtaining cost-effective
labels. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer
Society Conference on, pages 25–32. IEEE, 2010.
[167] S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution
with iterative blocking. In Proceedings of the 2009 ACM SIGMOD International Conference on
Management of data, SIGMOD ’09, pages 219–232, New York, NY, USA, 2009. ACM.
[168] J. Whitehill, T.-f. Wu, J. Bergsma, J. R. Movellan, and P. L. Ruvolo. Whose vote should count
more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural
information processing systems, pages 2035–2043, 2009.
140
Bibliography
[169] W. Winkler. The state of record linkage and current research problems. In Statistical Research
Division, US Census Bureau, 1999.
[170] H. Yannakoudakis, T. Briscoe, and B. Medlock. A new dataset and method for automatically
grading esol texts. In Proceedings of the 49th Annual Meeting of the Association for Computa-
tional Linguistics: Human Language Technologies-Volume 1, pages 180–189. Association for
Computational Linguistics, 2011.
[171] M. Yin, Y. Chen, and Y.-A. Sun. Monetary Interventions in Crowdsourcing Task Switching. In
Proceedings of the 2nd AAAI Conference on Human Computation (HCOMP), 2014.
[172] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay scheduling:
a simple technique for achieving locality and fairness in cluster scheduling. In EuroSys ’10, pages
265–278. ACM, 2010.
[173] H. Zhang, E. Horvitz, Y. Chen, and D. C. Parkes. Task routing for prediction tasks. In Proceedings
of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 2,
pages 889–896. International Foundation for Autonomous Agents and Multiagent Systems, 2012.
141
Djellel Eddine DifallahAddress: Bd Perolles 90, Fribourg 1700, Switzerland.Email: [email protected], Phone: +41 76 822 0296
Research and Interests
My research focuses on combining the intelligence of humans in solving complex problems and the scalability of machinesto process large amounts of data. In particular, I try to overlap the two worlds by creating solutions to efficientlymanage crowd-workers to deliver timely inputs to machine requests. My work is supported by the Swiss NationalScience Foundation.Other Interests: data management, distributed systems, big data challenges.
Education
2011–today PhD candidate @ University of Fribourg, Switzerland.– Dissertation on “Quality of Service in Crowd-Powered Systems”.
2009–2011 MSc in Computer Science, University of Louisiana – Lafayette, USA.– Fulbright Foreign Student Scholarship.– Received honors for maintaining a GPA of 4.0 four semesters.
1999–2004 Diploma of Engineer in Informatics, USTHB, Algeria.
Professional Experience
2011–today Research Assistant at eXascale InfoLab.– Mainly focus on my dissertation related projects (Human-computation).– Contribute to other ongoing projects in the lab: semantic web (RDF/Graph storage), memorybased information systems (MEM0R1ES), smart cities (stream processing), array processing(SciDB).– Teaching assistant for the social computing class.– Supervise master students working on smart cities projects in collaboration with IBM Dublin.
Summer of 2013 Research Intern at Microsoft’s Cloud and Information Services Lab.Project: Reservation based scheduling with Hadoop YARN (YARN-1051).The work resulted in a paper to appear in the fifth ACM Symposium on Cloud Computing2014.
Summer of 2010 Student Developer at Google Summer of Code Program.Project: A query cache plugin for Drizzle DBMS based on memcached.
2006–2009 Information Management Engineer at Schlumberger.– On-site client support on data management softwares provided by the company.– Reporting and SQL tuning.
2005–2006 Engineer at EEPAD Internet Services Provider.– In charge of the authentication system platform (RADIUS) for both line and wireless clients.– Developed an automatic provisioning solution to synchronize the information system and thedeployed ADSL hardware using SNMP.
Opensource Projects
Lead OLTPBench, Openturk chrome extension.Contributor Apache Hadoop YARN, Apache Mahout, SciDB, Drizzle.
Relevant Computer Skills
Programming Java, C++, Python, Javascript.DBMS MySQL, Postgres.
Languages
French, English, Arabic.
Publications
2015 Djellel E. Difallah, Michele Catasta, Gianluca Demartini, Panagiotis G Ipeirotis, and PhilippeCudre-Mauroux. The dynamics of micro-task crowdsourcing – the case of amazon mturk. InWWW, Florence, Italy, 2015.
Dana Van Aken, Djellel E. Difallah, Andrew Pavlo, Carlo Curino, and Philippe Cudre-Mauroux.BenchPress: Dynamic Workload Control in the OLTP-Bench Testbed. In SIGMOD, Melbourne,Australia, 2015. ACM, ACM
2014 Djellel Eddine Difallah, Michele Catasta, Gianluca Demartini, and Philippe Cudre-Mauroux.Scaling-up the crowd: Micro-task pricing schemes for worker retention and latency improvement.In Second AAAI Conference on Human Computation and Crowdsourcing difallah-scaleup., 2014
Michele Catasta, Alberto Tonon, Djellel Eddine Difallah, Gianluca Demartini, Karl Aberer, andPhilippe Cudre-Mauroux. Transactivedb: Tapping into collective human memories. Proceedingsof the VLDB Endowment, 2013
Michele Catasta, Alberto Tonon, Djellel Eddine Difallah, Gianluca Demartini, Karl Aberer, andPhilippe Cudre-Mauroux. Hippocampus: answering memory queries using transactive search.In Proceedings of the companion publication of the 23rd international conference on World wideweb companion, pages 535–540. International World Wide Web Conferences Steering Committee,2014
2013 Djellel Eddine Difallah, Gianluca Demartini, and Philippe Cudre-Mauroux. Pick-a-crowd: tellme what you like, and i’ll tell you what to do. In Proceedings of the 22nd international confer-ence on World Wide Web, pages 367–374. International World Wide Web Conferences SteeringCommittee, 2013
Djellel Eddine Difallah, Philippe Cudre-Mauroux, and S McKenna. Scalable anomaly detectionfor smart city infrastructure networks. 2013
Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudre-Mauroux. Large-scalelinked data integration using probabilistic reasoning and crowdsourcing. The VLDB Journal,22(5):665–687, 2013
Djellel Eddine Difallah, Andrew Pavlo, Carlo Curino, and Philippe Cudre-Mauroux. Oltp-bench: An extensible testbed for benchmarking relational databases. Proceedings of the VLDBEndowment, 7(4), 2013
2012 Michele Catasta, Alberto Tonon, Djellel Eddine Difallah, Gianluca Demartini, Karl Aberer, andPhilippe Cudre-Mauroux. Transactivedb: Tapping into collective human memories. Proceedingsof the VLDB Endowment, 2013
G. Demartini, D.E. Difallah, and P. Cudre-Mauroux. Zencrowd: leveraging probabilistic rea-soning and crowdsourcing techniques for large-scale entity linking. In Proceedings of the 21stinternational conference on World Wide Web, pages 469–478. ACM, 2012
D.E. Difallah, G. Demartini, and P. Cudr-Mauroux. Mechanical cheat: Spamming schemesand adversarial techniques on crowdsourcing platforms. CrowdSearch 2012 workshop at WWW,pages 26–30, 2012
2011 P. Cudre-Mauroux, G. Demartini, D.E. Difallah, A.E. Mostafa, V. Russo, and M. Thomas. ADemonstration of DNS3: a Semantic-Aware DNS Service. ISWC, 2011
D.E. Difallah, R.G. Benton, V. Raghavan, and T. Johnsten. Faarm: Frequent association actionrules mining using fp-tree. In Data Mining Workshops (ICDMW), 2011 IEEE 11th InternationalConference on, pages 398–404. IEEE, 2011