Download - Crowdsourcing the Semantic Web
Crowdsourcing the Semantic Web
Elena Simperl
University of Southampton
Seminar@University of Manchester 4 March 2015
Crowdsourcing Web semantics
• Crowdsourcing is increasingly used to augment the results of algorithms solving Semantic Web problems
• Great challenges
• Which form of crowdsourcing for which task?
• How to design the crowdsourcing exercise effectively?
• How to combine human- and machine-driven approaches?
Typology of crowdsourcing
Macrotasks Microtasks
Challenges Self-organized crowds
Crowdfunding
4
Source: Source: (Prpić, Shukla, Kietzmann and McCarthy 2015)
Dimensions of crowdsourcing • What to crowdsource?
– Tasks you can’t complete in-house or using computers
– A question of time, budget, resources, ethics etc.
• Who is the crowd?
– Crowdsourcing ≠‘turkers’
– Open call, biased through use of platforms and promotion channels
– No traditional means to manage and incentivize
– Crowd has little to no context about the project
• How to crowdsource?
– Macro vs. microtasks
– Complex workflows
– Assessment and aggregation
– Aligning incentives
– Hybrid systems
– Learn to optimize from previous interactions
5
Hybrid systems (or ‚social machines‘)
06-Mar-15 Tutorial@ISWC2013 6
Physical world (people and devices)
Design and composition Participation and data supply
Model of social interaction
Virtual world (Network of
social interactions)
Source: Dave Robertson
CROWDSOURCING DATA CURATION
Crowdsourcing Linked Data quality assessment M Acosta, A Zaveri, E Simperl, D Kontokostas, S Auer, J Lehmann ISWC 2013, 260-276
7
Overview
• Compare two forms of crowdsourcing (challenges and paid microtasks) in the context of data quality assessment and repair
• Experiments on DBpedia
– Identify potential errors, classify, and repair them
– TripleCheckMate challenge vs. Mechanical Turk
8
What to crowdsource
• Incorrect object
Example: dbpedia:Dave_Dobbyn dbprop:dateOfBirth “3”.
• Incorrect data type or language tags
Example: dbpedia:Torishima_Izu_Islands foaf:name “鳥島”@en.
• Incorrect link to “external Web pages”
Example: dbpedia:John-Two-Hawks dbpedia-owl:wikiPageExternalLink
<http://cedarlakedvd.com/>
Who is the crowd
Challenge
LD Experts Difficult task Final prize
Find Verify
Microtasks
Workers Easy task Micropayments
TripleCheckMate [Kontoskostas2013]
MTurk
Adapted from [Bernstein2010]
http://mturk.com
How to crowdsource: microtask design
• Selection of foaf:name or
rdfs:label to extract human-
readable descriptions
• Values extracted automatically
from Wikipedia infoboxes
• Link to the Wikipedia article via foaf:isPrimaryTopicOf
• Preview of external pages by implementing HTML iframe
Incorrect object
Incorrect data type or language tag
Incorrect outlink
Experiments
• Crowdsourcing approaches
• Find stage: Challenge with LD experts
• Verify stage: Microtasks (5 assignments)
• Gold standard
• Two of the authors generated gold standard for all contest triples indepedently
• Conflicts resolved via mutual agreement
• Validation: precision
Overall results LD experts Microtask
workers
Number of distinct participants
50 80
Total time 3 weeks (predefined) 4 days
Total triples evaluated 1,512 1,073
Total cost ~ US$ 400 (predefined) ~ US$ 43
Precision results: Incorrect objects • Turkers can reduce the error rates of LD experts from the Find stage
• 117 DBpedia triples had predicates related to dates with incorrect/incomplete values:
”2005 Six Nations Championship” Date 12 .
• 52 DBpedia triples had erroneous values from the source:
”English (programming language)” Influenced by ? .
• Experts classified all these triples as incorrect
• Workers compared values against Wikipedia and successfully classified this triples as “correct”
Triples compared LD Experts MTurk (majority voting: n=5)
509 0.7151 0.8977
Precision results: Incorrect data types
0
20
40
60
80
100
120
140
Date English Millimetre NanometreNumber
Numberwithdecimals
Second Volt Year Notspecified /URI
Nu
mb
er
of
trip
les
Data types
Experts TP
Experts FP
Crowd TP
Crowd FP
Triples compared LD Experts MTurk (majority voting: n=5)
341 0.8270 0.4752
Precision results: Incorrect links
• We analyzed the 189 misclassifications by the experts
• The 6% misclassifications by the workers correspond to pages with a language different from English
50% 39%
11%
Freebase links
Wikipedia images
External links
Triples compared Baseline LD Experts MTurk (n=5 majority voting)
223 0.2598 0.1525 0.9412
Conclusions
• The effort of LD experts should be invested in tasks that demand domain-specific skills
– Experts seem less motivated to perform on tasks that come with an additional overhead (e.g., checking an external link, going back on Wikipedia, data repair)
• MTurk crowd was exceptionally good at performing object types and links checks
– But seem not to understand the intricacies of data types
Future work
• Additional experiments with data types
• Workflows for new types of errors, e.g., tasks which experts cannot answer on the fly
• Training the crowd
• Building the DBpedia ‘social machine’
– Add automatic QA tools
– Close the feedback loop (from crowd to experts)
– Change the parameters of the ‘Find’ step to achieve seamless integration
– Change crowdsourcing model of ‘Find’ to improve retention
19
CROWDSOURCING IMAGE ANNOTATION
Improving paid microtasks through gamification and adaptive furtherance incentives O Feyisetan, E Simperl, M Van Kleek, N Shadbolt WWW 2015, to appear
20
Overview
• Make paid microtasks more cost-effective though gamification*
*use of game elements and mechanics in a non-game context
Research hypotheses
• Contributors will perform better if tasks are more engaging
• Increased accuracy through higher inter-annotator agreement
• Cost savings through reduced unit costs
• Micro-targeting incentives when players attempt to quit improves retention
How to crowdsource: microtask design
• Image labeling tasks published on microtask platform
– Free text labels, varying numbers of labels per image, taboo words
– 1st setting: ‘standard’ tasks, including basic spam control
– 2nd setting: the same requirements and rewards, but contributors were asked to carry out the task in WordSmith
25
How to crowdsource: gamification
• Levels – 9 levels from newbie to Wordsmith, function of images tagged
• Badges – function of number of images tagged
• Bonus points – for new tags
• Treasure points – for multiples of bonus points
• Feedback alerts - related to badges, points, levels
• Leaderboard - hourly scores and level of the top 5 players
• Activities widget – real-time updates on other players
26
How to crowdsource: incentives
• Money – 5 cents extra for the effort
• Power – see how other players tagged the images
• Leaderboard - advance to the ‘Global’ leaderboard seen by everyone
• Badges – receive ’Ultimate’ badge and a shiny new avatar
• Levels – advance straight to the next level
• Access - quicker access to treasure points
27
Assigned at random or targeted based on Bayes inference (number of tagged images as feature)
Experiments
• Crowdsourcing approaches
– Paid microtasks
– Wordsmith GWAP
• Data set: ESP Game (120K images with curated labels)
• Validation: precision
28
Experiments (2)
• 1st experiment: ‘newcomers’
– Tag one image with two keywords ~ complete Wordsmith level 1
– 200 images, US$ 0.02 per task, 3 judgements per task
• 2nd experiment: ‘novice’
– Tag 11 images ~ complete Wordsmith level 2
– 2200 images, US$ 0.10 per task, 600 workers
– Furtherance incentives: none, random, targeted (Bayes inference based on # of images tagged)
29
Results: 1st experiment
Metric CrowdFlower Wordsmith
Total workers 600 423
Total keywords 1,200 41,206
Unique keywords 111 5,708
Avg. agreement 5.72% 37.7%
Mean images/person 1 32
Max images/person 1 200
Results: 2nd experiment
Metric CrowdFlower Wordsmith (no furtherance incentives
Total workers 600 514
Total keywords 13,200 35,890
Unique keywords 1,323 4,091
Avg. agreement 6.32% 10.9%
Mean images/person 11 27
Max images/person 1 351
Results: 2nd experiment, incentives
• Incentives led to increased participation
– Power: mean image/person = 132
– People come back (20 times!) and play longer (43 hours, 3 hours without incentives)
32
Conclusions and future work
• Task design matters as much as payment
• Gamification achieves high accuracy for lower costs and improved engagement
• Workers appreciate social features, but their main motivation is still task-driven
– Feedback mechanisms, peer learning, training
33